MPI vs OpenMP and

Prithwish_Nandi2 · January 30, 2020, 2:40pm

Hi,

Say, I have a node having 40 cores. I want to run the simplest LJ-problem having 32000 atoms. Now, there are 2 scenarios :

I can run a LAMMPS job on this node using 40 MPI processes and no OpenMP threading.
I can also run 4 MPI processes and 10 OpenMP threads per MPI process.

In both cases I have 40 workers. Other than the OMP based code optimization, which one from the above listed is the better choice and why? Or, Why should we bother about the OpenMP threading when we can not have more than 40 worker any way?

Thanks,
PKN\

akohlmey · January 30, 2020, 3:48pm

Hi,

Say, I have a node having 40 cores. I want to run the simplest LJ-problem having 32000 atoms. Now, there are 2 scenarios :

I can run a LAMMPS job on this node using 40 MPI processes and no OpenMP threading.

I can also run 4 MPI processes and 10 OpenMP threads per MPI process.

you have many more options. you can do any combination of MPI and OpenMP where the produce of MPI ranks and OpenMP threads is 40.

In both cases I have 40 workers. Other than the OMP based code optimization, which one from the above listed is the better choice and why?

There is no such simple answer to this. It always depends on specifics of your system and your hardware and the details of your simulation features. There is a detailed discussion of how to get good speed and how to take advantage of threads in the LAMMPS manual.

A few general trends:

LAMMPS was designed with MPI in mind, so this parallelization is deeply integrated into the code and everything is designed around domain decomposition as parallelization strategy. Using MPI parallelization as the primary strategy is often beneficial on modern hardware as the working data sets gets smaller and that will improve cache efficiency, which is very important with modern CPUs, as they are much faster than they can be fed data from the main memory.
MPI parallelization can run into load balancing issues for inhomogeneous systems or slab systems. in the latter case, this can often be improve by using the processors keyword in a smart fashion, beyond that, there are the load balancing commands and changing the communication/decomposition strategy.
both, the USER-INTEL and the USER-OMP packages provide optimized code that is written to be more cache efficient and vectorize better (USER-INTEL much more aggressively so than USER-OMP, but with restrictions in advanced functionality) and will support threads. Since threads parallelize over particles, not domains, they much less affected by load balancing issues. because thread support in LAMMPS is an add-on, it is less effective and its implementation is less efficient particularly for larger numbers of threads. This is different for the KOKKOS package when compiled with OpenMP enabled, which supports threading strategies that will trade a less efficient computation against better parallel scaling and thus will be superior at very large number of threads, e.g. as needed on Xeon Phi or KNL type CPUs.
in general, you have to keep in mind, that there always is a limit as to how far you can parallelize (limit of strong scaling). For dense systems with simple potentials (e.g. lj/cut with reduce units and 2.5 sigma cutoff, or granular styles), this is often reached at a few hundred to a couple thousand particles per processor. More complex potentials need to spend more time on force computation, thus the communication overhead and non-parallel parts of code in LAMMPS have less impact and thus the MPI scale out limit is at a lower number of particles, except for styles that need to use significant amounts of communication during force evaluation.
it is often a mistake to focus on direct parallel scaling performance. as there are other options to improve performance (e.g. run_style respa or if using kspace using run_style verlet/split). also, excessive and redundant use of computes can negatively impact performance and parallel scaling.

Axel.

Stan_Moore · February 4, 2020, 2:25am

This is different for the KOKKOSpackage when compiled with OpenMP enabled, which supports threading strategies that will trade a less efficient computation against better parallel scaling…

I just wanted to note that for many pair styles, the default for the KOKKOS package was changed last year to be the same as the USER-OMP and use data duplication (faster but less scalable). Atomics (slower but more scalable) can still be used as in the past, but require setting a compile-time flag as described in the docs.

Stan