I have a system of around 4000 molecules of H2O and CO2. The HPC I use for lammps has 24 processors per node. If I use one node, the speed is 63.115 timesteps/s. In order to speed up, I try to use three node but the speed is much lower to 26.637 timesteps/s. Some people say that this is because my system is not big enough to use multiple nodes. But if I just want to save time, how can I use multiple nodes at a time appropriately?
as a rule of the thumb, typical MD force kernels for classical MD
systems in the condensed phase, will be efficient up to a few 100
atoms per MPI rank. you have 1200 atoms, so with only one node, you
may be close to the scale out point.
in addition to that, you are probably using fix rigid, (or fix
rigid/small). this requires extra communication, and especially in the
case of fix rigid, this is not very efficient for a large number of
rigid objects and thus limit scaling.
in your case, trying to use more than 1 node, is not likely to produce
much of a speedup. you might get lucky, using USER-OMP or USER-INTEL
multi-threaded pair styles with 2 OpenMP threads per MPI rank and span
2 nodes. but that requires careful adjusting of the mpirun command
line, so that the MPI ranks are properly placed on the various CPU
cores. processor and memory binding (to the corresponding socket) is
recommended for that as well, to achieve optimal performance.
LAMMPS provides details performance breakdown in the output, so you
can figure out where the time is spent. when aiming for absolute
maximal performance, you want to turn those timings measurements off,
though, since even those have some small overhead.