optimal mixing of MPI and openMP

Peter_Klaver · February 12, 2016, 10:25am

Hello people,

I'm trying to optimize lammps performance on some rather sparse systems (~10 vol%) by mixing MPI and openMP parallelization, to improve load balancing (following advice from Axel posted a while ago). For testing I had first looked at a simple, fully filled, periodic system of 16M Fe atoms in a cubic box (200^3 bcc cells). Timings were as follows:

16-way mpi, no openMP: 26.8 s
4-way mpi, 4 openMP threads: 24.9 s
single mpi process, 16 openMP threads: 301.8 s
single mpi process, 8 openMP threads: 302.7 s
The increase in loop time comes almost entirely from increased Pair time.

So even for a completely filled system, reducing the number of mpi threads may be slightly beneficial, but it rapidly gets worse as you go to fewer mpi processes and more openMP threads. Question is: what determines this optimum, and is there a general pattern of where it lies, or does it need to be checked for every different type and size of calculation?

Could the slow result for many openMP threads be the result of much data being channeled through one cpu core first, before being distributed over all the cores, causing a data bandwidth bottleneck on the one core? Or is it inherent in the way the code is written? Or something else still?

Also, I've compared the slight differences in the results between mpi-only and mixed mpi/openMP parallelisation to differences between cpu and gpu runs.

For the mpi-only run the thermo file for a 10-step run is

Step Time Temp TotEng PotEng KinEng Press Pxx Pyy
       0 0 1000.0358 -62139481 -64207717 2068236.3 11861.448 11859.309 11864.033
       1 0.001 998.18969 -62139683 -64204101 2064418.2 11847.462 11845.352 11850.036
       2 0.002 992.71788 -62140187 -64193289 2053101.6 11806.394 11804.372 11808.932
       3 0.003 983.82674 -62140675 -64175389 2034713.3 11750.559 11748.681 11753.038
       4 0.004 971.61957 -62141144 -64150611 2009466.9 11709.351 11707.672 11711.746
       5 0.005 956.20467 -62141682 -64119268 1977586.4 11704.936 11703.51 11707.223
       6 0.006 937.78343 -62142258 -64081746 1939488.2 11751.22 11750.092 11753.374
       7 0.007 916.58574 -62142839 -64038487 1895648 11856.667 11855.879 11858.66
       8 0.008 892.85012 -62143415 -63989974 1846558.9 12024.083 12023.669 12025.89
       9 0.009 866.82939 -62143980 -63936724 1792743.8 12251.429 12251.417 12253.028
      10 0.01 838.78704 -62144529 -63879276 1734747.7 12532.484 12532.897 12533.855
Loop time of 26.8387 on 16 procs for 10 steps with 16000000 atoms

While for 4-way mpi with 4 openMP threads it is

Step Time Temp TotEng PotEng KinEng Press Pxx Pyy
       0 0 1000.0358 -62139481 -64207717 2068236.3 11861.448 11859.309 11864.033
       1 0.001 998.18987 -62139683 -64204102 2064418.6 11847.457 11845.344 11850.022
       2 0.002 992.71862 -62140187 -64193290 2053103.1 11806.374 11804.337 11808.876
       3 0.003 983.82841 -62140675 -64175392 2034716.7 11750.508 11748.597 11752.908
       4 0.004 971.62252 -62141144 -64150617 2009473 11709.258 11707.521 11711.514
       5 0.005 956.20922 -62141682 -64119277 1977595.8 11704.788 11703.272 11706.861
       6 0.006 937.78989 -62142258 -64081760 1939501.6 11751.007 11749.753 11752.862
       7 0.007 916.59439 -62142839 -64038505 1895665.9 11856.384 11855.424 11857.988
       8 0.008 892.86118 -62143415 -63989997 1846581.7 12023.725 12023.087 12025.053
       9 0.009 866.84307 -62143980 -63936752 1792772.1 12250.998 12250.703 12252.029
      10 0.01 838.80352 -62144529 -63879311 1734781.7 12531.974 12532.038 12532.692
Loop time of 24.9285 on 16 procs for 10 steps with 16000000 atoms

For a cpu run with the same potential and a different but comparable system size the thermo file is

Step Time Temp TotEng PotEng KinEng Press Pxx Pyy
       0 0 1000.0358 -25653489 -26507929 854440.03 10398.03 10275.564 10639.04
     100 0.1 510.81514 -25651424 -26087869 436445.27 13209.695 13097.171 13424.793
     200 0.2 456.43779 -25650357 -26040342 389984.75 12835.449 12751.272 13012.617
Loop time of 403.869 on 10 procs for 200 steps with 6610000 atoms

while on gpus the same run produces

Step Time Temp TotEng PotEng KinEng Press Pxx Pyy
       0 0 1000.0358 -25653489 -26507929 854440.03 10398.03 10275.564 10639.04
     100 0.1 510.81514 -25651424 -26087869 436445.27 13209.695 13097.171 13424.793
     200 0.2 456.43779 -25650357 -26040342 389984.75 12835.449 12751.272 13012.617
Loop time of 81.9992 on 10 procs for 200 steps with 6610000 atoms

So with openMP threading, the total energy is well preserved but the division between kinetic and potential energy is already slightly different right after the first MD step. And thus also a kinetic energy derived quantity like temperature. By contrast, cpu and gpu results are still perfectly similar after 200 steps. Is the greater difference with openMP parallelisation something to be concerned about?

greets,
Peter

akohlmey · February 12, 2016, 10:59am

hello peter,

Hello people,

I'm trying to optimize lammps performance on some rather sparse systems (~10 vol%) by mixing MPI and openMP parallelization, to improve load balancing (following advice from Axel posted a while ago). For testing I had first looked at a simple, fully filled, periodic system of 16M Fe atoms in a cubic box (200^3 bcc cells). Timings were as follows:

16-way mpi, no openMP: 26.8 s
4-way mpi, 4 openMP threads: 24.9 s
single mpi process, 16 openMP threads: 301.8 s
single mpi process, 8 openMP threads: 302.7 s
The increase in loop time comes almost entirely from increased Pair time.

So even for a completely filled system, reducing the number of mpi threads may be slightly beneficial, but it rapidly gets worse as you go to fewer mpi processes and more openMP threads. Question is: what determines this optimum, and is there a general pattern of where it lies, or does it need to be checked for every different type and size of calculation?

there are multiple factors that play a role. the multi-thread
implementation in USER-OMP is specifically written to be effective for
a small number of threads. for that it uses per-thread copies of the
force arrays, which requires a reduction after all forces are
computed. the overhead of this reduction operation is what limits
parallel efficiency. the impact of this depends on the relative amount
of time spent in the pair compute and that in turn depends on the
cutoff and the algorithmic intensity of the computation. e.g. for
granular models with very short cutoffs, USER-OMP is not very
effective and other multi-thread approaches are much more
advantageous.

Could the slow result for many openMP threads be the result of much data being channeled through one cpu core first, before being distributed over all the cores, causing a data bandwidth bottleneck on the one core? Or is it inherent in the way the code is written? Or something else still?

on typical machines, the memory layout is not very optimal and Linux'
first touch approach to memory locality does not help. it is usually
beneficial to have per-socket processor affinity and not worth trying
to multi-thread across sockets.

other items that have a negative impact are per-atom energy or stress
computes (as they have to be made thread safe as well using multiple
copies and thus require additional reductions).

Also, I've compared the slight differences in the results between mpi-only and mixed mpi/openMP parallelisation to differences between cpu and gpu runs.

For the mpi-only run the thermo file for a 10-step run is

Step Time Temp TotEng PotEng KinEng Press Pxx Pyy
       0 0 1000.0358 -62139481 -64207717 2068236.3 11861.448 11859.309 11864.033
       1 0.001 998.18969 -62139683 -64204101 2064418.2 11847.462 11845.352 11850.036
       2 0.002 992.71788 -62140187 -64193289 2053101.6 11806.394 11804.372 11808.932
       3 0.003 983.82674 -62140675 -64175389 2034713.3 11750.559 11748.681 11753.038
       4 0.004 971.61957 -62141144 -64150611 2009466.9 11709.351 11707.672 11711.746
       5 0.005 956.20467 -62141682 -64119268 1977586.4 11704.936 11703.51 11707.223
       6 0.006 937.78343 -62142258 -64081746 1939488.2 11751.22 11750.092 11753.374
       7 0.007 916.58574 -62142839 -64038487 1895648 11856.667 11855.879 11858.66
       8 0.008 892.85012 -62143415 -63989974 1846558.9 12024.083 12023.669 12025.89
       9 0.009 866.82939 -62143980 -63936724 1792743.8 12251.429 12251.417 12253.028
      10 0.01 838.78704 -62144529 -63879276 1734747.7 12532.484 12532.897 12533.855
Loop time of 26.8387 on 16 procs for 10 steps with 16000000 atoms

While for 4-way mpi with 4 openMP threads it is

Step Time Temp TotEng PotEng KinEng Press Pxx Pyy
       0 0 1000.0358 -62139481 -64207717 2068236.3 11861.448 11859.309 11864.033
       1 0.001 998.18987 -62139683 -64204102 2064418.6 11847.457 11845.344 11850.022
       2 0.002 992.71862 -62140187 -64193290 2053103.1 11806.374 11804.337 11808.876
       3 0.003 983.82841 -62140675 -64175392 2034716.7 11750.508 11748.597 11752.908
       4 0.004 971.62252 -62141144 -64150617 2009473 11709.258 11707.521 11711.514
       5 0.005 956.20922 -62141682 -64119277 1977595.8 11704.788 11703.272 11706.861
       6 0.006 937.78989 -62142258 -64081760 1939501.6 11751.007 11749.753 11752.862
       7 0.007 916.59439 -62142839 -64038505 1895665.9 11856.384 11855.424 11857.988
       8 0.008 892.86118 -62143415 -63989997 1846581.7 12023.725 12023.087 12025.053
       9 0.009 866.84307 -62143980 -63936752 1792772.1 12250.998 12250.703 12252.029
      10 0.01 838.80352 -62144529 -63879311 1734781.7 12531.974 12532.038 12532.692
Loop time of 24.9285 on 16 procs for 10 steps with 16000000 atoms

For a cpu run with the same potential and a different but comparable system size the thermo file is

Step Time Temp TotEng PotEng KinEng Press Pxx Pyy
       0 0 1000.0358 -25653489 -26507929 854440.03 10398.03 10275.564 10639.04
     100 0.1 510.81514 -25651424 -26087869 436445.27 13209.695 13097.171 13424.793
     200 0.2 456.43779 -25650357 -26040342 389984.75 12835.449 12751.272 13012.617
Loop time of 403.869 on 10 procs for 200 steps with 6610000 atoms

while on gpus the same run produces

Step Time Temp TotEng PotEng KinEng Press Pxx Pyy
       0 0 1000.0358 -25653489 -26507929 854440.03 10398.03 10275.564 10639.04
     100 0.1 510.81514 -25651424 -26087869 436445.27 13209.695 13097.171 13424.793
     200 0.2 456.43779 -25650357 -26040342 389984.75 12835.449 12751.272 13012.617
Loop time of 81.9992 on 10 procs for 200 steps with 6610000 atoms

So with openMP threading, the total energy is well preserved but the division between kinetic and potential energy is already slightly different right after the first MD step. And thus also a kinetic energy derived quantity like temperature. By contrast, cpu and gpu results are still perfectly similar after 200 steps. Is the greater difference with openMP parallelisation something to be concerned about?

this sounds a lot like a bug. can you check this with a small input
deck and then compare:

1 mpi, no openmp (i.e. regular style)
1 mpi, 1 openmp (i.e. including -sf omp, but default of 1 thread)
1 mpi, 2 openmp and
1 mpi, 4 openmp

thanks,
axel.

Peter_Klaver · February 12, 2016, 1:16pm

Hi Axel,

Please ignore my bit about differences between pure mpi and mixed mpiu/openmp runs. In truly amateurish fashion, I had overlooked one tiny bit in my input file that indirectly relies on a random number generator call. That works differently for different numbers of mpi threads. Apologies for that noise on the list.

But thanks for the info on the mixing of mpi and openmp. That is very useful for me to know.

greets,
Peter

sjplimp · February 12, 2016, 3:31pm

You can also look at the balance and fix balance commands

which may give better parallel efficiencies for sparse (spatially imbalanced)

systems with normal all-MPI runs.

Steve