Hello,
For consideration, attached is a small package/patch COMM_NPROCS that addresses some scalability performance issues I’ve observed when setting up large molecular systems on large numbers of MPI ranks (e.g. +100Ms of particles on +10Ks MPI ranks). The suggested changes largely address logic in a few places that involves loops over processors, which become prohibitive in setting up large-scale runs. I suspect most users will see negligible performance impact from these changes, but if you regularly run large systems on +10K MPI ranks, then they might be worth taking a look at if you observe rather long setup times when replicating systems, building molecular topologies, or creating remap plans for 3D FFTs.
With these changes, I was able to successfully setup a modified rhodo replicated system (without kspace method) with 36.86 billion particles on 786,432 MPI ranks in 42 seconds compared to the original projected time of 18 hours. Again, this was just setting up the simulation in LAMMPS. With the changes below, I’m currently projecting the PPPM setup time to be ~6 minutes for the same system, compared to the original projected time of 12 days. And, in this case, the PPPM setup time would further improve if an additional hint could be passed during creation of one of the plans (the last one).
My edits and suggestions certainly don’t cover 100% of use cases in LAMMPS (everything works with rhodo), but hopefully they all can be adapted and included in the LAMMPS source and, at the very least, activated in certain cases and/or by specific request from a user with additional keywords as in my implementation. The modifications are based from a recent git pull.
Thanks for your consideration,
chris
COMM_NPROCS.tar.gz (20.2 KB)