Hi All,
Thank you for the advice. Using balance RCB has sped up the performance quite a bit but I still think it should be faster.
what makes you think this?
have you done some proper strong scaling test? if yes, please report its result. if not, please do it and learn from it.
please also learn about Amdahl’s law and keep in mind, that Amdahl’s law does not account for “parallel overhead”.
When I run the simulation of just liquid there is roughly 100 atoms per core after using balance RCB. When I add the vapor phase, I also increase the number of cores used which amounts to roughly 45 atoms per core. The liquid-vapor simulation is slower even with the fewer atoms per core. Why might this be? Looking at the MPI task time breakdown, most of the time spent is in Kspace and Comm.
no parallel software scales indefinitely. at 100 or less atoms, you are already way beyond the scale out point for simple classical force fields without kspace. however, kspace has its own parallel scaling limit and that is dominated by the parallel FFTs. for normal classical MD calculations with kspace (via PPPM), their parallel efficiency is usually best, when the ratio between Pair and Kspace is about 2:1. the Pair part generally scales much better than the Kspace part.
as for why parallel FFTs are a problem, when using too many MPI ranks, you need to look at what is done in PPPM.
all charges are smeared out and the resulting charge density density grid computed (this smearing because of adding a damping to charge interactions in real space, that has to be subtracted in reciprocal space). for a 3d-FFT you need to do a series of 1d-FFT along each direction. for those to work you need all gridpoints in that direction, but that is not how the atom data is distributed (or the grid computed) in a code with domain decomposition. thus the grid point data has to be redistributed requiring MPI all-to-all communication. after that each MPI rank holds “sticks” or “pencils” of the total grid. since you now can only distribute the grid in 2d (as the third direction has to be continuous on a single MPI rank), you are limited as to how much work you can distribute. add to that, that you need to do a transpose between each set of 1d-FFTs so that you can do them in x-, y-, and z-direction. and with regular PPPM you need to one 3d FFT forward and one 3d FFT backward. for each of the transpose and the remap back and forth between “sticks” and domain decomposed data, you have communication overhead, and that overhead is growing the more MPI ranks you have and conversely the “productive work” is getting less the more MPI ranks there are.
there is no sense to go beyond the strong scaling limit and even reaching that is very wasteful.
you can limit the impact of what i like to call “the curse of the kspace” by using either a mix of multi-threading and MPI or the verlet/split run style. while multi-threading doesn’t parallelize as efficiently as MPI in LAMMPS, it allows for PPPM to have fewer MPI ranks in the 3d-FFTs and thus limits the parallel overhead through all-to-all communication. with verlet/split, you split the kspace part to a separate partition of MPI ranks and thus can pick the number of ranks, where PPPM scales out.
since you have a rather low and inhomogeneous particle density, it is not clear how much you can gain by any of these. it also has to go in tandem with choosing the optimal cutoff for coulomb interactions. finally, you need to check whether your kspace forces are properly converged. the accuracy based estimator, that LAMMPS uses by default to infer the reciprocal space parameters, only works properly for homogeneous density. for an inhomogeneous particle density, you may be lacking accuracy in the high particle density parts.
axel