Hello people,
I'm trying to optimize lammps performance on some rather sparse systems (~10 vol%) by mixing MPI and openMP parallelization, to improve load balancing (following advice from Axel posted a while ago). For testing I had first looked at a simple, fully filled, periodic system of 16M Fe atoms in a cubic box (200^3 bcc cells). Timings were as follows:
16-way mpi, no openMP: 26.8 s
4-way mpi, 4 openMP threads: 24.9 s
single mpi process, 16 openMP threads: 301.8 s
single mpi process, 8 openMP threads:  302.7 s
The increase in loop time comes almost entirely from increased Pair time.
So even for a completely filled system, reducing the number of mpi threads may be slightly beneficial, but it rapidly gets worse as you go to fewer mpi processes and more openMP threads. Question is: what determines this optimum, and is there a general pattern of where it lies, or does it need to be checked for every different type and size of calculation?
Could the slow result for many openMP threads be the result of much data being channeled through one cpu core first, before being distributed over all the cores, causing a data bandwidth bottleneck on the one core? Or is it inherent in the way the code is written? Or something else still?
Also, I've compared the slight differences in the results between mpi-only and mixed mpi/openMP parallelisation to differences between cpu and gpu runs.
For the mpi-only run the thermo file for a 10-step run is
Step Time Temp TotEng PotEng KinEng Press Pxx Pyy
       0            0    1000.0358    -62139481    -64207717    2068236.3    11861.448    11859.309    11864.033
       1        0.001    998.18969    -62139683    -64204101    2064418.2    11847.462    11845.352    11850.036
       2        0.002    992.71788    -62140187    -64193289    2053101.6    11806.394    11804.372    11808.932
       3        0.003    983.82674    -62140675    -64175389    2034713.3    11750.559    11748.681    11753.038
       4        0.004    971.61957    -62141144    -64150611    2009466.9    11709.351    11707.672    11711.746
       5        0.005    956.20467    -62141682    -64119268    1977586.4    11704.936     11703.51    11707.223
       6        0.006    937.78343    -62142258    -64081746    1939488.2     11751.22    11750.092    11753.374
       7        0.007    916.58574    -62142839    -64038487      1895648    11856.667    11855.879     11858.66
       8        0.008    892.85012    -62143415    -63989974    1846558.9    12024.083    12023.669     12025.89
       9        0.009    866.82939    -62143980    -63936724    1792743.8    12251.429    12251.417    12253.028
      10         0.01    838.78704    -62144529    -63879276    1734747.7    12532.484    12532.897    12533.855
Loop time of 26.8387 on 16 procs for 10 steps with 16000000 atoms
While for 4-way mpi with 4 openMP threads it is
Step Time Temp TotEng PotEng KinEng Press Pxx Pyy
       0            0    1000.0358    -62139481    -64207717    2068236.3    11861.448    11859.309    11864.033
       1        0.001    998.18987    -62139683    -64204102    2064418.6    11847.457    11845.344    11850.022
       2        0.002    992.71862    -62140187    -64193290    2053103.1    11806.374    11804.337    11808.876
       3        0.003    983.82841    -62140675    -64175392    2034716.7    11750.508    11748.597    11752.908
       4        0.004    971.62252    -62141144    -64150617      2009473    11709.258    11707.521    11711.514
       5        0.005    956.20922    -62141682    -64119277    1977595.8    11704.788    11703.272    11706.861
       6        0.006    937.78989    -62142258    -64081760    1939501.6    11751.007    11749.753    11752.862
       7        0.007    916.59439    -62142839    -64038505    1895665.9    11856.384    11855.424    11857.988
       8        0.008    892.86118    -62143415    -63989997    1846581.7    12023.725    12023.087    12025.053
       9        0.009    866.84307    -62143980    -63936752    1792772.1    12250.998    12250.703    12252.029
      10         0.01    838.80352    -62144529    -63879311    1734781.7    12531.974    12532.038    12532.692
Loop time of 24.9285 on 16 procs for 10 steps with 16000000 atoms
For a cpu run with the same potential and a different but comparable system size the thermo file is
Step Time Temp TotEng PotEng KinEng Press Pxx Pyy
       0            0    1000.0358    -25653489    -26507929    854440.03     10398.03    10275.564     10639.04
     100          0.1    510.81514    -25651424    -26087869    436445.27    13209.695    13097.171    13424.793
     200          0.2    456.43779    -25650357    -26040342    389984.75    12835.449    12751.272    13012.617
Loop time of 403.869 on 10 procs for 200 steps with 6610000 atoms
while on gpus the same run produces
Step Time Temp TotEng PotEng KinEng Press Pxx Pyy
       0            0    1000.0358    -25653489    -26507929    854440.03     10398.03    10275.564     10639.04
     100          0.1    510.81514    -25651424    -26087869    436445.27    13209.695    13097.171    13424.793
     200          0.2    456.43779    -25650357    -26040342    389984.75    12835.449    12751.272    13012.617
Loop time of 81.9992 on 10 procs for 200 steps with 6610000 atoms
So with openMP threading, the total energy is well preserved but the division between kinetic and potential energy is already slightly different right after the first MD step. And thus also a kinetic energy derived quantity like temperature. By contrast, cpu and gpu results are still perfectly similar after 200 steps. Is the greater difference with openMP parallelisation something to be concerned about?
greets,
Peter