[lammps-users] parallel running problem

Dear Axel,

Gald to hear that, it looks more reasonable than VMD case. I really look forward to your solution!

By the way, I have some experience maybe serve as reference for you. I encountered such problem before, and I remember I modified the …/mpich/share/machines.LINUX to specify the node involvled, and a parameter as “-machinefile machine” also need to accompany with mpirun command, as:mpirun -machinefile machine -np 2 ./lmp_g++ < in.shear. Then everything is normal with parallel running. However, it doesn’t work with the same procedure currently, I will try to figure out what’s wrong.

Best regards,
Damien
2009-07-21

Dear Axel,

dear damien,

after some effort i managed to track down the problem and
dispite my hunches, the problem is the new lammpsplugin in
the latest VMD beta. it contains an optimization that only
works, if all atoms in the very first step are sorted.
this is a very common case in lammps dumps, but not always
true and not at all in general. removing this optimization
makes it work at the cost of needing more memory to read
a lammps trajectory. since the full VMD release is imminent,
i don't have the time to work on implementing a better variant
of this optimization (which applies quite often), but rather
have to disable it, so we will have a working plugin in
the release version.

as a temporary workaround for similar systems, i suggest you
run in serial until you have a first frame written out, load
this frame into VMD and then immediately save it as a .psf file
(animate write psf out.psf) and then load the parallel dump
file with: vmd out.psf -lammpstrj my.dump
that should still work.

best regards,
    axel.

Dear all,

I've been using the fortran version of Lammps on single processors for quite some time, and recently we got the access to large clusters, so we decided to move to the c version and do parallel simulations, hoping to get the work done much faster. Thanks to the well documented lammps manual and the sample make files in the lammps distribution package, I was able to compile the July7-09 version without too much trouble.
However, when I tested the parallel version on the linux cluster, I've only got about 20% efficiency (still 80% of CPU time needed compared to the old serial version) even with 16 processors, it seems already saturated with 2 processors (19% efficiency). Since each node of the cluster has 2 processors, the possible reason I could guess is that communications between nodes are costly.
I believe people here on the list must have more experiences for the parallel version of Lammps, I know the parallel efficiency is very machine specific, but could you please share your efficiencies on your own clusters, and was the reason I listed above legitimate? I just try to make sure my lammps was built correctly and I didn't do something dumb.
thanks a lot.

Jihang

Dear all,

dear jihang,

[...]

However, when I tested the parallel version on the linux cluster, I've only
got about 20% efficiency (still 80% of CPU time needed compared to the old
serial version) even with 16 processors, it seems already saturated with 2
processors (19% efficiency). Since each node of the cluster has 2
processors, the possible reason I could guess is that communications between
nodes are costly.

right. let me guess, you have "only" a gigabit ethernet
network between your nodes? you should be able to see how
much time is spent on communication at the breakdown of
the times spent in various modules at the end of a job.

I believe people here on the list must have more experiences for the
parallel version of Lammps, I know the parallel efficiency is very machine
specific, but could you please share your efficiencies on your own clusters,
and was the reason I listed above legitimate? I just try to make sure my

well, to provide numbers to compare to, you have to provide an
input that should be used. there are many factors that govern
the performance. most importantly is communication latency,
and this is where a simple TCP/IP communication is a problem,
but there also are CPU performance and available memory bandwidth.

i'm attaching a list of benchmarks that i compiled about two years
ago using the CG-CMM test system from the LAMMPS distribution (and
the replicate command for the larger versions) that shows scaling
behavior on a variety of machines. after knowing the code better and
doing some optimizations, i am certain that the scaling could be
improved (e.g. by optimizing the processors grid, since the input
is a slab system). at the very end, you can find a comparison of
a machine where i tested gigabit ethernet and infiniband on the
same machine and it is pretty telling about the performance of gigE.

cheers,
   axel.

bench.cgcmm.txt (8.83 KB)

Thanks, Axel,
the table you've provided is very helpful. The speed you've got is amazing! I'm only using about 40,000 atoms and I may need 1 month to get 1 ns trajectories (I had to use pppm and 1 fs timestep, but still not even close to what you've had)
I'm not quite familiar with the cluster setup, I'm actually running jobs on ncsa mercury cluster, on their website, the listed Interconnect network is "Myrinet 2000, Gigabit Ethernet, Fiber Channel", I could be just testing on the nodes connected by Gigabit Ethernet as you mentioned.
And following your suggestion, I found the communication time in the end of the log file, for the 16 CPU run, the communication time is only less than 0.3% of the total time, that made me worried about my setup. You've metioned to optimize the processors grid for a slab system, could you please give a link for that, so I can learn about it since my simulation systems happen to be similar to slab systems as well.
I appreciate all you help.

Jihang

Thanks, Axel,
the table you've provided is very helpful. The speed you've got is amazing!
I'm only using about 40,000 atoms and I may need 1 month to get 1 ns
trajectories (I had to use pppm and 1 fs timestep, but still not even close
to what you've had)

well, one reason for the high speed and good parallelization
is the fact, that this input does _not_ use long range electrostatics
and thus doesn't have to do the FFTs that don't scale so well.

I'm not quite familiar with the cluster setup, I'm actually running jobs on
ncsa mercury cluster, on their website, the listed Interconnect network is
"Myrinet 2000, Gigabit Ethernet, Fiber Channel", I could be just testing on
the nodes connected by Gigabit Ethernet as you mentioned.

not likely. AFAIK, that cluster has myrinet throughout. i assume you
are using the pre-installed MPI and not compiled your own? then you
should be ok on that count. more problematic is the fact that most
compute kernels of the c++ version of LAMMPS use some constructs that
itanium processors don't like. have you compared the single processor
performance of the c++ version with the fortran version?

And following your suggestion, I found the communication time in the end of
the log file, for the 16 CPU run, the communication time is only less than
0.3% of the total time, that made me worried about my setup. You've metioned

you should also look into how much time is spent on the other parts,
you may probably see a high percentage in the Kspce entry.

to optimize the processors grid for a slab system, could you please give a
link for that, so I can learn about it since my simulation systems happen to
be similar to slab systems as well.

see the "processors" command. LAMMPS by default distributes the number
of processors across domains according to the box size. with a slab
you have a lot of vacuum and if processors are equally distributed
across space, you get a load imbalance (some processors have to do
much more work than others) so for a system with a vaccuum in
z-direction you want to use either one or two processors in z.

cheers,
    axel.

Thanks again, Axel, you always make really good points, I've learned a lot by reading your posts in the mailing list alone.
Here are some more questions,
1)I just noticed the Itanium processors in the NCSA mercury cluster are only at 1.3GHz compared to the 2.6GHz ones you are using, that might be a reason too? Plus the issue you mentioned that the Itanium processors don't like some LAMMPS constructs, I think I'll try to do it on other sites.
2)I do get kspace time at ~50% of the total time. Actually, I installed the OpenMPI recommended in LAMMPS manual in my own directory, since none of the mpi libarary I've found there compiled. I meant to ask about that issue before, but didn't bother because it worked eventually. Could this be an issue? if so, which mpi library do you recommend to use?
3)And about assigining processors in each direction, I always got a complain about "Bad grid of processors", but I do follow the instruction that the product of Px,Py and Pz must equal to the total # of processors, e.g. I used "Processors 4 4 1" for the 16-cpu job, why I'm still getting this error?
I appreciate all your kind advice.

jihang,

1)I just noticed the Itanium processors in the NCSA mercury cluster are only
at 1.3GHz compared to the 2.6GHz ones you are using, that might be a reason

you cannot compare clock rates with completely different architectures.
remember the intel pentium4 processor? it had very high clock rates,
but AMD opteron processors with much lower rates would run faster.

too? Plus the issue you mentioned that the Itanium processors don't like
some LAMMPS constructs, I think I'll try to do it on other sites.

yes, but those "just" make the code run slower, which should actually
help to get better scaling. nevertheless, the processors in mercury are
probably better suited for quantum chemistry calculations and similar
memory bandwidth hogging applications than for classical MD.

2)I do get kspace time at ~50% of the total time. Actually, I installed the
OpenMPI recommended in LAMMPS manual in my own directory, since none of the

ok. that means that you are most likely not using the GM library
and the fast interconnect.

you can validate this by running: ompi_info | grep MCA\ btl
it has to show a "gm" component to support myrinet. by default
it would only show the components: "self", "sm", "tcp".

which means, that you indeed use only GigE. the fact that Kspe
takes 50% of your time is another hint.

mpi libarary I've found there compiled. I meant to ask about that issue
before, but didn't bother because it worked eventually. Could this be an
issue? if so, which mpi library do you recommend to use?

i don't know that machine, but this is what NCSA has consulting staff
for. those people get paid to help you with _exactly_ those problems.
they'll be happy to assist you.

3)And about assigining processors in each direction, I always got a complain
about "Bad grid of processors", but I do follow the instruction that the
product of Px,Py and Pz must equal to the total # of processors, e.g. I used
"Processors 4 4 1" for the 16-cpu job, why I'm still getting this error?

you are probably not running correctly. that part in
lammps is simple and works correctly, so the mistake
has to be in your job script.

cheers,
   axel.

Hi, Axel, thanks again for your kind advice.
I think I'll move on to other clusters to see what happens there. I couldn't even launch the "ompi_info" command, does it indicate that I had some errors in the mpi library?
BTW, I could never get the 'processors' input straight, it drives me crazy, even when I use "processors 1 1 2" and "mpirun -np 2 ...", which I believe should mean I'm using 2 processors in total and Px,Py,Pz are 1,1,2, respectively, I still get an error message complaining "Bad grid of processors". this might be another indication I've messed up the mpi installation?

Hi, Axel, thanks again for your kind advice.
I think I'll move on to other clusters to see what happens there. I couldn't
even launch the "ompi_info" command, does it indicate that I had some errors
in the mpi library?

it should be part of openmpi and in your path if you have installed
openmpi correctly.

BTW, I could never get the 'processors' input straight, it drives me crazy,
even when I use "processors 1 1 2" and "mpirun -np 2 ...", which I believe
should mean I'm using 2 processors in total and Px,Py,Pz are 1,1,2,
respectively, I still get an error message complaining "Bad grid of
processors". this might be another indication I've messed up the mpi
installation?

or you are using the wrong mpirun command. try: which mpirun
it should point to the mpirun in your openmpi installation directory.
the fact that you get this error message consistently, looks like
you are running two serial calculations at the same time. check your
output. does every line come out twice?

cheers,
   axel.

Thanks so much, Axel.
I indeed used the wrong mpi library, 'which mpirun' is pointing to a mpi library pre-installed on the cluster, not the one I compiled lammps with. After I used the complete path for the right mpi library, the code doesn't complain the 'bad grid of processors' any more. It was just a stupid mistake, sorry for bothering you and others on the list. And hopefully this would be a good lesson for new guys who are starting the parallel runs.
Now I think I'll have to test the codes again with the right setup and see how the parallel performance really is on this cluster.
BTW, I didn't see each line twice with the wrong setup, except the line "LAMMPS (7 Jul 2009)", is this the line you were referring to?
anyway, thanks again, have a great day!

cheers,
Jihang