Dear developper,
I have installed lammps stable version from 29 sept 2021 on a 2 processor x 16 cores each intel Gold processor (Sky Lake architecture) 192 Go Memory.
I am using lammps with kokkos (built with …/cmake/presets/kokkos-openmp.cmake) and I am facing the following problem.
When running a simulation with
mpirun -np 4 lmp -k on -sf kk -in in.file I got the following memory usage. I do not understand why it is 50 CPU insted of 100 as when not using kokkos
230768 pascal 20 0 673132 42992 9288 R 50.0 0.0 1213:55 lmp
230769 pascal 20 0 672000 41104 9212 R 50.0 0.0 1213:55 lmp
230771 pascal 20 0 678612 48760 9172 R 50.0 0.0 1213:55 lmp
230770 pascal 20 0 671788 41724 9172 R 49.7 0.0 1213:55 lmp
when submitting a second 4 core run I got :
230768 pascal 20 0 673132 42992 9288 R 25.2 0.0 1207:37 lmp
263004 pascal 20 0 643352 19340 7572 R 25.2 0.0 0:04.43 lmp
263005 pascal 20 0 643740 17628 7528 R 25.2 0.0 0:03.92 lmp
230769 pascal 20 0 672000 41104 9212 R 24.9 0.0 1207:37 lmp
230770 pascal 20 0 671788 41724 9172 R 24.9 0.0 1207:37 lmp
230771 pascal 20 0 678612 48760 9172 R 24.9 0.0 1207:37 lmp
263003 pascal 20 0 643508 19520 7608 R 24.9 0.0 0:03.94 lmp
263006 pascal 20 0 643840 19772 7528 R 24.9 0.0 0:04.41 lmp
So the %CPU is divided by 2. I add --bind-to socket, it remains the same %CPU.
and it looks like all is running on the same core ?
for the second job (simply the in.lj bench file of lammps) the log.lammps file is
LAMMPS (29 Sep 2021)
KOKKOS mode is enabled (src/KOKKOS/kokkos.cpp:97)
will use up to 0 GPU(s) per node
using 1 OpenMP thread(s) per MPI task
package kokkos
3d Lennard-Jones melt
variable x index 1
variable y index 1
variable z index 1
variable xx equal 20*$x
variable xx equal 201
variable yy equal 20$y
variable yy equal 201
variable zz equal 20$z
variable zz equal 20*1
units lj
atom_style atomic
lattice fcc 0.8442
Lattice spacing in x,y,z = 1.6795962 1.6795962 1.6795962
region box block 0 {xx} 0 {yy} 0 {zz}
region box block 0 20 0 {yy} 0 {zz}
region box block 0 20 0 20 0 {zz}
region box block 0 20 0 20 0 20
create_box 1 box
Created orthogonal box = (0.0000000 0.0000000 0.0000000) to (33.591924 33.591924 33.591924)
1 by 2 by 2 MPI processor grid
create_atoms 1 box
Created 32000 atoms
using lattice units in orthogonal box = (0.0000000 0.0000000 0.0000000) to (33.591924 33.591924 33.591924)
create_atoms CPU = 0.308 seconds
mass 1 1.0
velocity all create 1.44 87287 loop geom
pair_style lj/cut 2.5
pair_coeff 1 1 1.0 1.0 2.5
neighbor 0.3 bin
neigh_modify delay 0 every 20 check no
fix 1 all nve
run 100
Neighbor list info …
update every 20 steps, delay 0 steps, check no
max neighbors/atom: 2000, page size: 100000
master list distance cutoff = 2.8
ghost atom cutoff = 2.8
binsize = 1.4, bins = 24 24 24
1 neighbor lists, perpetual/occasional/extra = 1 0 0
(1) pair lj/cut/kk, perpetual
attributes: half, newton on, kokkos_device
pair build: half/bin/kk/device
stencil: half/bin/3d
bin: kk/device
Per MPI rank memory allocation (min/avg/max) = 2.968 | 2.968 | 2.968 Mbytes
Step Temp E_pair E_mol TotEng Press
0 1.44 -6.7733681 0 -4.6134356 -5.0197073
100 0.7574531 -5.7585055 0 -4.6223613 0.20726105
Loop time of 30.5473 on 4 procs for 100 steps with 32000 atoms
Performance: 1414.202 tau/day, 3.274 timesteps/s
24.9% CPU use with 4 MPI tasks x 1 OpenMP threads
MPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %total