[lammps-users] issue with coupling MPI tasks with OpenMP

pascalbrault · January 13, 2022, 11:30pm

Hi all
I am running a job (LAMMPS 27Oct2021) with 9 MPI tasks and 2 openMP threads per MPI tasks.
Pair_style is
pair_style hybrid eam/alloy lj/cut 4.0 table linear 4000

All pair_style support omp as checked in the manual.
I also used fix deposit command.

When running there are 9 MPI tasks but CPU is 101 instead of expected 200% sonce 2 openMP threads per tasks

Another job use meam and fix deposit which does not support omp accelerator and I find the same CPU usage 100% as expected since meam does not include omp.

When running with in.colloid script of the example folder with 9 MPI tasks and 2 openMP threads per MPI tasks I find 200% CPU usage as expected.

Is the issue of CPU usage coming from fix deposit ? since CPU usage is correct for a script without fix deposit and a supported pair_style and not correct with supported pair-style and fix deposit command ?

Or there is something I misunderstand

Thanks for your help.
Pascal

akohlmey · January 14, 2022, 1:14am

it is impossible to comment on this without further information, e.g. seeing a log file.

pascalbrault · January 14, 2022, 5:22pm

Dear Alex,

I have performed the calculations mentionned below on case tests and attached the log files as you suggested.
1/ I ran the in.deposit.atom using 4MPI and 2 OMP threads per tasks and got 130.6% CPU use —> log file: log_deposit-lammps27Oct2021.lammps

I ran 2 jobs with pair_style hybrid eam/alloy lj/cut 4.0 table linear 4000 and fix deposit.
2/ with 4MPI and 1 OMP threads per tasks —> 99.9% CPU use —> log file: logMPI4OMP1.lammps
3/ with 4MPI and 2 OMP threads per tasks —> 100.3% CPU use —> log file: logMPI4OMP2.lammps

Case 2 and 3 are shorter and with atoms of an original program. %CPU is identical to case 3 after 20 days simulation time using 4 MPI tasks and 2 OMP thereads per tasks (checked with top since it always running.

I expected for case 1/ and 3/ 200%CPU.

I hope this can be of help for you
Thanks a lot again
Best regards
Pascal

log_deposit-lammps27Oct2021.lammps (10.2 KB)

logMPI4OMP1.lammps (39.1 KB)

logMPI4OMP2.lammps (39.1 KB)

akohlmey · January 14, 2022, 5:54pm

Pascal,

neither of your runs actually uses the OpenMP accelerated pair styles (which all end in /omp).
you can easily see this from the neighbor list summaries, where the listed pair styles.
so the only possible multi-thread speedup would come from the neighbor lists.

my guess is that you are using the package omp command to enable multi-threading, but not the suffix command to select the accelerated styles with the /omp suffix.

please also note that your deposit run has huge load imbalances due to the domain decomposition and no adjustment of the processor distribution.
it also is not worth parallelizing such a small system (at least not with MPI) with < 1000 atoms and a fast pair style (lj/cut) with a rather short cutoff (2.5 sigma).

akohlmey · January 14, 2022, 6:35pm

This is explained in the manual.
In LAMMPS the code path for accelerated styles is different from the serial code path, even for OpenMP parallization, although those styles can be compiled and run without OpenMP active.

pascalbrault · January 14, 2022, 6:32pm

Thanks a lot Axel
The real system which is actually running has few hundred thousand atoms. I will check too for load balance.
I reduce size for getting fast log file
What I understand too is I do not correctly call the omp package. I will correct this. I only use export OMP_NUM_THREADS=2 command prior to mpirun -n 4 lmp in in.file → I understand that I should add -sf omp there ?
Thanks
Best regards
Pascal

Pascal,

neither of your runs actually uses the OpenMP accelerated pair styles (which all end in /omp).
you can easily see this from the neighbor list summaries, where the listed pair styles.
so the only possible multi-thread speedup would come from the neighbor lists.

my guess is that you are using the package omp command to enable multi-threading, but not the suffix command to select the accelerated styles with the /omp suffix.

please also note that your deposit run has huge load imbalances due to the domain decomposition and no adjustment of the processor distribution.
it also is not worth parallelizing such a small system (at least not with MPI) with < 1000 atoms and a fast pair style (lj/cut) with a rather short cutoff (2.5 sigma).