[lammps-users] CPU and and running lammps with KOKKOS

pascalbrault · November 25, 2021, 9:43am

Dear developper,
I have installed lammps stable version from 29 sept 2021 on a 2 processor x 16 cores each intel Gold processor (Sky Lake architecture) 192 Go Memory.
I am using lammps with kokkos (built with …/cmake/presets/kokkos-openmp.cmake) and I am facing the following problem.
When running a simulation with
mpirun -np 4 lmp -k on -sf kk -in in.file I got the following memory usage. I do not understand why it is 50 CPU insted of 100 as when not using kokkos

230768 pascal 20 0 673132 42992 9288 R 50.0 0.0 1213:55 lmp
230769 pascal 20 0 672000 41104 9212 R 50.0 0.0 1213:55 lmp
230771 pascal 20 0 678612 48760 9172 R 50.0 0.0 1213:55 lmp
230770 pascal 20 0 671788 41724 9172 R 49.7 0.0 1213:55 lmp

when submitting a second 4 core run I got :

230768 pascal 20 0 673132 42992 9288 R 25.2 0.0 1207:37 lmp
263004 pascal 20 0 643352 19340 7572 R 25.2 0.0 0:04.43 lmp
263005 pascal 20 0 643740 17628 7528 R 25.2 0.0 0:03.92 lmp
230769 pascal 20 0 672000 41104 9212 R 24.9 0.0 1207:37 lmp
230770 pascal 20 0 671788 41724 9172 R 24.9 0.0 1207:37 lmp
230771 pascal 20 0 678612 48760 9172 R 24.9 0.0 1207:37 lmp
263003 pascal 20 0 643508 19520 7608 R 24.9 0.0 0:03.94 lmp
263006 pascal 20 0 643840 19772 7528 R 24.9 0.0 0:04.41 lmp

So the %CPU is divided by 2. I add --bind-to socket, it remains the same %CPU.
and it looks like all is running on the same core ?

for the second job (simply the in.lj bench file of lammps) the log.lammps file is

LAMMPS (29 Sep 2021)
KOKKOS mode is enabled (src/KOKKOS/kokkos.cpp:97)
will use up to 0 GPU(s) per node
using 1 OpenMP thread(s) per MPI task
package kokkos

3d Lennard-Jones melt

variable x index 1
variable y index 1
variable z index 1

variable xx equal 20*$x
variable xx equal 201
variable yy equal 20$y
variable yy equal 201
variable zz equal 20$z
variable zz equal 20*1

units lj
atom_style atomic

lattice fcc 0.8442
Lattice spacing in x,y,z = 1.6795962 1.6795962 1.6795962
region box block 0 {xx} 0 {yy} 0 {zz} region box block 0 20 0 {yy} 0 {zz} region box block 0 20 0 20 0 {zz}
region box block 0 20 0 20 0 20
create_box 1 box
Created orthogonal box = (0.0000000 0.0000000 0.0000000) to (33.591924 33.591924 33.591924)
1 by 2 by 2 MPI processor grid
create_atoms 1 box
Created 32000 atoms
using lattice units in orthogonal box = (0.0000000 0.0000000 0.0000000) to (33.591924 33.591924 33.591924)
create_atoms CPU = 0.308 seconds
mass 1 1.0

velocity all create 1.44 87287 loop geom

pair_style lj/cut 2.5
pair_coeff 1 1 1.0 1.0 2.5

neighbor 0.3 bin
neigh_modify delay 0 every 20 check no

fix 1 all nve

run 100
Neighbor list info …
update every 20 steps, delay 0 steps, check no
max neighbors/atom: 2000, page size: 100000
master list distance cutoff = 2.8
ghost atom cutoff = 2.8
binsize = 1.4, bins = 24 24 24
1 neighbor lists, perpetual/occasional/extra = 1 0 0
(1) pair lj/cut/kk, perpetual
attributes: half, newton on, kokkos_device
pair build: half/bin/kk/device
stencil: half/bin/3d
bin: kk/device
Per MPI rank memory allocation (min/avg/max) = 2.968 | 2.968 | 2.968 Mbytes
Step Temp E_pair E_mol TotEng Press
0 1.44 -6.7733681 0 -4.6134356 -5.0197073
100 0.7574531 -5.7585055 0 -4.6223613 0.20726105
Loop time of 30.5473 on 4 procs for 100 steps with 32000 atoms

Performance: 1414.202 tau/day, 3.274 timesteps/s
24.9% CPU use with 4 MPI tasks x 1 OpenMP threads

pascalbrault · November 25, 2021, 10:26am

Dear all
I have to add the followining information.
When submitting
lmp -in in.lj → 100 CPU for 1 MPI tasks x 1 OpenMP threads mpirun -np 1 -in in.lj -> 100 CPU for 1 MPI tasks x 1 OpenMP threads
mpirun -np 2 -in in.lj → 100% CPU for 2 MPI tasks x 1 OpenMP threads
mpirun -np 4 -in in.lj → 50% CPU for 4 MPI tasks x 1 OpenMP threads
mpirun -np 4 --bind-to socket -in in.lj → 50% CPU for 1 MPI tasks x 1 OpenMP threads

I do not understand whre it comes from. I never got this reduction of %CPU.
Thanks for your help
Best
Pascal

akohlmey · November 25, 2021, 11:14am

Is this on a workstation or on a cluster?
If the latter, what is your resource request?

It would also help if you could include the output of:
lstopo --output-format console

and/or:
numactl -H

This looks like you have access to two physical CPU cores only.
The reduction in %CPU like you see happens always when multiple MPI ranks have to share the same CPU core.

pascalbrault · November 25, 2021, 11:30am

Hi Axel
Thanks for you help. Yes it looks like I have such limited access. Using another software (AMS Suite) there are no such problem ?? And this quite new, it comes with lLammps 29 sept install.

This is a workstation and the requested outputs follow:

[pascal@gremi27 ~] numactl -H bash: numactl: command not found... [pascal@gremi27 ~]

[pascal@gremi27 ~] lstopo --output-format console Machine (187GB total) NUMANode L#0 (P#0 93GB) Package L#0 + L3 L#0 (22MB) L2 L#0 (1024KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 PU L#0 (P#0) PU L#1 (P#32) L2 L#1 (1024KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 PU L#2 (P#1) PU L#3 (P#33) L2 L#2 (1024KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 PU L#4 (P#2) PU L#5 (P#34) L2 L#3 (1024KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 PU L#6 (P#3) PU L#7 (P#35) L2 L#4 (1024KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 PU L#8 (P#4) PU L#9 (P#36) L2 L#5 (1024KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 PU L#10 (P#5) PU L#11 (P#37) L2 L#6 (1024KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 PU L#12 (P#6) PU L#13 (P#38) L2 L#7 (1024KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 PU L#14 (P#7) PU L#15 (P#39) L2 L#8 (1024KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 PU L#16 (P#8) PU L#17 (P#40) L2 L#9 (1024KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 PU L#18 (P#9) PU L#19 (P#41) L2 L#10 (1024KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 PU L#20 (P#10) PU L#21 (P#42) L2 L#11 (1024KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 PU L#22 (P#11) PU L#23 (P#43) L2 L#12 (1024KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 PU L#24 (P#12) PU L#25 (P#44) L2 L#13 (1024KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 PU L#26 (P#13) PU L#27 (P#45) L2 L#14 (1024KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 PU L#28 (P#14) PU L#29 (P#46) L2 L#15 (1024KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 PU L#30 (P#15) PU L#31 (P#47) HostBridge L#0 PCI 8086:a1bc PCI 8086:2826 Block(Disk) L#0 "sda" Block(Disk) L#1 "sdc" Block(Disk) L#2 "sdd" PCI 8086:15b9 Net L#3 "enp0s31f6" HostBridge L#1 PCI 8086:201d HostBridge L#2 PCIBridge PCI 10de:1bb1 GPU L#4 "card0" GPU L#5 "renderD128" GPU L#6 "controlD64" NUMANode L#1 (P#1 94GB) + Package L#1 + L3 L#1 (22MB) L2 L#16 (1024KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16 PU L#32 (P#16) PU L#33 (P#48) L2 L#17 (1024KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17 PU L#34 (P#17) PU L#35 (P#49) L2 L#18 (1024KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18 PU L#36 (P#18) PU L#37 (P#50) L2 L#19 (1024KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19 PU L#38 (P#19) PU L#39 (P#51) L2 L#20 (1024KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20 PU L#40 (P#20) PU L#41 (P#52) L2 L#21 (1024KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21 PU L#42 (P#21) PU L#43 (P#53) L2 L#22 (1024KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22 PU L#44 (P#22) PU L#45 (P#54) L2 L#23 (1024KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23 PU L#46 (P#23) PU L#47 (P#55) L2 L#24 (1024KB) + L1d L#24 (32KB) + L1i L#24 (32KB) + Core L#24 PU L#48 (P#24) PU L#49 (P#56) L2 L#25 (1024KB) + L1d L#25 (32KB) + L1i L#25 (32KB) + Core L#25 PU L#50 (P#25) PU L#51 (P#57) L2 L#26 (1024KB) + L1d L#26 (32KB) + L1i L#26 (32KB) + Core L#26 PU L#52 (P#26) PU L#53 (P#58) L2 L#27 (1024KB) + L1d L#27 (32KB) + L1i L#27 (32KB) + Core L#27 PU L#54 (P#27) PU L#55 (P#59) L2 L#28 (1024KB) + L1d L#28 (32KB) + L1i L#28 (32KB) + Core L#28 PU L#56 (P#28) PU L#57 (P#60) L2 L#29 (1024KB) + L1d L#29 (32KB) + L1i L#29 (32KB) + Core L#29 PU L#58 (P#29) PU L#59 (P#61) L2 L#30 (1024KB) + L1d L#30 (32KB) + L1i L#30 (32KB) + Core L#30 PU L#60 (P#30) PU L#61 (P#62) L2 L#31 (1024KB) + L1d L#31 (32KB) + L1i L#31 (32KB) + Core L#31 PU L#62 (P#31) PU L#63 (P#63) [pascal@gremi27 ~]

Best
Pascal

akohlmey · November 25, 2021, 11:40am

this is not something that LAMMPS manipulates, so it has to be due to your MPI library. Like it cannot properly detect your hardware topology.
Or you have some configuration file somewhere that changes what it considers available.

If you are using OpenMPI, then try running with --display-allocation added. That may provide a hint, but at this point this is getting very difficult to debug from remote.

pascalbrault · November 25, 2021, 11:58am

Hi Axel
Thank you
I have checked the System Monitor and there 64 CPU detected (I guess these are the logical ones ie 2 times physical ones ?)
It is mpirun (Open MPI) 4.1.1
I have reboot too. and try again with and without the flag you suggest.
It does not change
I have install openMPI with the native gcc 4.8.5. But for using KOKKOS I have install gcc11.1 and used it for building LAMMPS 29 sept.
Could be this the issue ? And if yes I will need to re-install openMPI with gcc11.1 ?
Best
Pascal

akohlmey · November 25, 2021, 12:19pm

No

pascalbrault · November 25, 2021, 12:32pm

Ok thanks Axel
I have just ran an AMS suite program on 16 core using openMPI and there were no problem: CPU was 100. I searched for log file from the last LAMMPS version (27 May 2021) prior to install of LAMMPS 29 sept 2021 and found a 9 MPI task program which succesfully ran with 100 CPU on each task. Since this time only LAMMPS version and CENTOS yum update has been done
Also I have installed exactly the same LAMMPS version on another machine and each MPI task use 100% CPU both with (ARCH = SNB) and without KOKKOS
I do not know if it helps ?
Do you need more info ? Or should I search in another direction than LAMMPS install ?
Best
Pascal

akohlmey · November 25, 2021, 1:19pm

I cannot help any further. You need someone local.

pascalbrault · November 25, 2021, 2:54pm

I understand

Thanks a lot for all you did

Pascal Brault
DR CNRS
GREMI UMR 7344
CNRS Université d’Orléans