[lammps-users] Poor performance with multiple partitions

Dear LAMMPS developers,

I am using LAMMPS to evaluate atomic forces according to the ReaxFF force field for 20 predetermined sets of atomic positions. In the main input file (crca_20201212_114224_000_0000.in) I have a loop which repeatedly includes other files (crca_20201212_114224_000_0000_{1…20}.in) containing the atomic positions. With a single processor, this takes 8.609s on my laptop running Debian stretch. However, if I instead use 2 partitions of 1 processor each, and use a uloop variable so that the 20 atomic configurations are split between the 2 partitions, the runtime increases to 11.924s instead of decreasing as I expected it to. Do you have any idea why this happened, and how I might improve the performance as I use more processors? I have attached the input files for the 1-process run in mwefast.tar.gz and the 2-process run in mweslow.tar.gz. I timed the runs by calling ‘time ./run.sh’ in each of the directories once decompressed. I’d be very grateful for your advice.

Kind regards,
Matthew Okenyi

mwefast.tar.gz (127 KB)

mweslow.tar.gz (128 KB)

the problem is in what you are testing. you are not really measuring the cost of the actual computation, but mostly the overhead of setting up the simulations and synchronization between partitions.
if you loop at the “Loop” line you see that the actual calculation takes only a miniscule amount of time (some very small fraction of a second).

when changing the “run 0” command into a “run 100” command the behavior is more like expected.

for 2 partitions the time output is:
real 0m43.617s
user 1m12.768s
sys 0m0.650s

for 1 partition the time output is:
real 1m9.777s
user 1m8.155s
sys 0m0.605s

as you can see the walltime (real) is decreasing and the consumed CPU time (“user”) is almost the same. thus the difference is just waiting and synchronization.


Dear Axel,

Thanks very much for such a prompt reply. I’ve decided to use the set command to change the atomic positions instead of clearing all the settings in between calls to run. This saves on the overhead you mention.

Kind regards,