[lammps-users] Very Slow MPI-OMP Run

RezaNamakian · January 10, 2021, 6:20am

Dear all,

After doing some updates on my LINUX machines, I noticed MPI-OMP or MPI runs have become very slow when number of cores are increased for LAMMPS runs.

I have attached an Excel file with some information including LINUX version, CPU structure, OMPI version, packages installed in LAMMPS, and results for the benchmark of LAMMPS with the script “in.meam”.

I used the latest stable version lammps-29Oct20 and the latest version of LAMMPS lammps-24Dec20, and the speed was very low for both versions. I got the same low speed on a different LINUX machine.

I appreciate it if anyone can help me with this issue. By the way I used the commands below to run the tests:

mpirun -np 40 --bind-to socket --map-by socket ./lmp -sf omp -pk omp 1 -in in.meam

mpirun -np 40 ./lmp -in in.meam

#define _OPENMP 201511

Thanks,

Reza.

in.meam (347 Bytes)

Ni.meam (20 Bytes)

library.meam (9.38 KB)

LAMMPS MPI-OMP Test.xlsx (27.3 KB)

akohlmey · January 10, 2021, 9:51pm

two comments:
a) it is meaningless to use the omp suffix since there is no USER-OMP variant of the “meam/c” pair style and the other multi-threaded code paths are - by construction - slightly slower with only 1 thread. …and the timing info show show that the vast majority of time is spent on the pair style anyway (so time saved on any other parts of the code will only have a minor impact).
b) the slowdown looks more like that you have a usage conflict on the node that you are running on and that there are only two free CPU cores available. it would be more helpful to see the entire timing output to have a better understanding of where the time is spent and whether the processes are using the CPUs effectively or there are (unexpected) load imbalances.

please find below the output of:

for s in 1 2 4 8 12 16 24 32 ; do mpirun -np $s ./lmp -in in.meam -log log.$s ; grep -A14 Loop log.$s >> timing.txt ; echo ‘---------’ >> timing.txt; done

on a 4-way Octa-core machine with the latest LAMMPS version (but all recent LAMMPS versions should produce equivalent timings). For your system size of 32000 atoms, parallel scaling should be quite good, even though there is some non-parallelizable overhead in the MEAM pair style computation. at 32 MPI ranks you can see the the “Pair” section of the calculation is at about 70% parallel efficiency ( 28.32/(1.2597*32)*100.0 = 70.255 ). On top of that you have about 6% overhead from communication leading to a total parallel efficiency of about 66% with the number of Timesteps/s growing steadily from 3.454 to 73.060. Please take note that load imbalances are small (but growing) for Pair and CPU utilization is at or close to 100% (as it should).

In short, there doesn’t seem to be an intrinsic slowdown of LAMMPS in the way you claim there is, which in turn means that there is a local problem.

Axel.

Loop time of 28.948 on 1 procs for 100 steps with 32000 atoms

Performance: 1.492 ns/day, 16.082 hours/ns, 3.454 timesteps/s
100.0% CPU use with 1 MPI tasks x 1 OpenMP threads

RezaNamakian · January 11, 2021, 5:42am

Dear Axel,

Thanks for the recommendations.

Following your suggestion, I provided more information attached to this email for two different computers: one with 40 cores- lammps-24Dec20 and one with 28 cores- lammps-29Oct20. I tried to provide the required information as much as possible for the two systems in the related Excel files. Both systems give bad scaling by increasing the number of cores.

I agree with you this should not be a LAMMPS issue, since I was using many cores with the same versions of LAMMPS above on these machines without such a problem. Just to emphasize, I noticed this weird issue happened after updating Linux on both computers, I guess.

By the way, I asked another user of the 28 core machine to run the benchmark “in.eam” script, and I got the results below which show the scaling correlates well with increasing number of cores, however, for a different version of LAMMPS and Open MPI.

Thanks in advance for your time and consideration,

Reza.

LAMMPS: 3Mar20
MPI: Open MPI 2.1.1
Linux: Ubuntu 18.04

timing28Cores-lammps-29Oct20.txt (5.01 KB)

in.eam (511 Bytes)

timing40Cores-lammps-24Dec20.txt (5.71 KB)

LAMMPS 28 Cores Computer-lammps-29Oct20.xlsx (30.1 KB)

LAMMPS 40 Cores Computer lammps-29Oct20.xlsx (32.9 KB)

Cu_u3.eam (35.8 KB)

akohlmey · January 11, 2021, 9:43am

If you look at your timing output you see that for all runs not producing the expected performance, your CPU utilization is not in the 99-100 range. This is something outside of the control of LAMMPS and means that for some reason, your calculations have to share CPU cores. The most likely explanation for that is that there are multiple calculations running at the same time and thus blocking each other. Another possible reason is that you are running inside a limited partition inside a compute node of an HPC cluster due to an incorrect request to the batch system. The least likely explanation is some misconfiguration of the MPI library. Either case is impossible to debug without access to the same hardware and observing you while running the calculations and in either case it is something that is rather trivial to determine/correct on your side.

Be it as it may, neither explanation is related to the LAMMPS version (or OS or other software version) and thus off-topic for this mailing list. Keep in mind that correlation != causation. The fact that some other user gets the expected scaling on the same facility confirms that this is a case of PEBCAK.

Axel.

p.s.: when posting to a mailing list, please never send excel sheets (or files in any other application specific format) when simple text files will serve the same purpose.

RezaNamakian · January 14, 2021, 10:32pm

Dear Axel,

I am not sure if this was the main cause of the slow parallel running, but following the documentation on KOKKOS package:

" For binding threads with KOKKOS OpenMP, use thread affinity environment variables to force binding. With OpenMP 3.1 (gcc 4.7 or later, intel 12 or later) setting the environment variable OMP_PROC_BIND=true should be sufficient. In general, for best performance with OpenMP 4.0 or later set OMP_PROC_BIND=spread and OMP_PLACES=threads"

I set the environment variables as above to use OMP package:

export OMP_PROC_BIND=spread

export OMP_PLACES=threads

However, after removing these variables, as you can see from the attached text file, the scaling by increasing number of cores seem to be okay.

for s in 1 2 4 8 12 16 24 32 40; do mpirun -np $s “…/lammps-24Dec20/build/lmp” -in in.meam -log log.$s ; grep -A14 Loop log.$s >> timing.txt ; echo ‘---------’ >> timing.txt; done

Bests and stay safe,

Reza.

timing.txt (6.35 KB)

akohlmey · January 15, 2021, 12:29am

This statement clearly says that it applies to using the KOKKOS package with OpenMP, which is not used in your case. As I already pointed out there is no OpenMP support in any of the features your input is using, so there cannot be any benefit from using OpenMP related processor affinity definitions. Additionally, for as long as you are OpenMP processor affinity it is just logical that you should also use the corresponding processor affinity settings from your MPI library (which typically will give you a small performance benefit). If you apply affinity settings inconsistently, it is not surprising to get inconsistent performance.

So my assessment stands, whatever is going wrong is something that is incorrect on your side and not a LAMMPS issue.

Axel.