LAMMPS with MPI runs slower than serial

Yihua_Zhou1 · April 8, 2015, 9:23pm

Dear all

Could anyone tell me how to check the parallel computing algorithm of the LAMMPS? When I used the lmp_mpi with user-ATC package to simulate a system who has about 100,000 atoms, 100,000 bonds, 300,000 angles and about 90,000 dihedrals, its simulation speed is slower than the serial executable. Actually it is 3~4 times faster with serial executable than with MPI (I used about 8 processors). By the way, the size of my system is about 100 nm. Could anyone tell me why and how to fix this problem? Because I may have to try larger systems in future which could not be handled by serial executable (on my desktop). Thanks very much and any suggestion is welcome.

Best regards

Yihua Zhou

akohlmey · April 8, 2015, 9:53pm

Dear all

Could anyone tell me how to check the parallel computing algorithm of the
LAMMPS? When I used the lmp_mpi with user-ATC package to simulate a system
who has about 100,000 atoms, 100,000 bonds, 300,000 angles and about 90,000
dihedrals, its simulation speed is slower than the serial executable.
Actually it is 3~4 times faster with serial executable than with MPI (I used
about 8 processors). By the way, the size of my system is about 100 nm.
Could anyone tell me why and how to fix this problem? Because I may have to
try larger systems in future which could not be handled by serial executable
(on my desktop). Thanks very much and any suggestion is welcome.

there are two things that you should check, and both require that you
set up an input that doesn't use anything from the AtC package.

a) you need to check whether you are using the LAMMPS parallelization,
which is by default based by dividing your system into a regular grid
of subdomains with only looking at the volume of your simulation cell.
if your simulation cell has a significant amount of vacuum, then you
may produce a significant load imbalance. this can be addressed by
either restricting how processors are assigned via the processors
command, adjusting the dividing planes so that each subdomain has a
similar number of atoms with the balance command or using tiled
decomposition instead of the regular brick. only the first of those
three options should be safe for all features in LAMMPS, the second is
likely compatible, and the third may not work for any fixes that were
written without taking this different decomposition scheme under
consideration. for using as few as 8 CPUs, already using the
processors keyword should be sufficient. please note that dump image
can visualize the subdomains.
with this part of the test you can find out whether you use the
parallelization in the atomistic part correctly. you should see a
significant speedup going from 1 to 2 to 4 and to 8 processors.
ideally it should be 2x at every step.

b) the fix atc documentation cautions that the FE solver is replicated
and run in serial on all processors. so you should now run the input
without AtC and with AtC in serial on the same machine and determine,
how much time is spent on the FE part of the calculation. then you
should compare parallel runs. if the added time is the same, then your
loss of performance was due to inefficient domain decomposition. if
the loss of performance grows significantly, then you should consider
using multi-threading via USER-OMP, USER-INTEL or KOKKOS (in
multi-thread mode) instead of MPI, as that would indicate that a
significant amount of additional communication is required. you also
have to consider what kind of parallel interconnect, if any, you are
using for your parallel run. if this is TCP/IP there is little hope.

you can also determine some of this indirectly by looking at the
performance summary at the end of a run.

...of course, you want to make your test runs relatively short, or you
will be spending far too much time on testing and benchmarking.

axel