[lammps-users] cannot get a better performance on GPUs compared to CPUs

Hi,

I am running the script in.lj from: https://fzj-jsc.github.io/tuning_lammps/05-accelerating-lammps/index.html#learn-to-call-the-gpu-package-from-the-command-line

For the CPU case I am running the script in this way:
srun lmp -in in.lj

while for the GPU case (kokkos) I use:
srun lmp -in in_3.lj -k on g 2 -sf kk -pk kokkos cuda/aware off

performance on CPUs (28 cores) is 20348.676 tau/day while on GPUs 4622.496 tau/day. I have tried several switches but I have failed at getting at least the same performance of CPUs by using GPUs.

Here are the log files, I appreciate any comment.

########### CPU ########################

using 1 OpenMP thread(s) per MPI task
Lattice spacing in x,y,z = 1.6796 1.6796 1.6796
Created orthogonal box = (0 0 0) to (100.776 100.776 100.776)
2 by 2 by 7 MPI processor grid
Created 864000 atoms
create_atoms CPU = 0.00476717 secs
Neighbor list info …
update every 20 steps, delay 0 steps, check no
max neighbors/atom: 2000, page size: 100000
master list distance cutoff = 2.8
ghost atom cutoff = 2.8
binsize = 1.4, bins = 72 72 72
1 neighbor lists, perpetual/occasional/extra = 1 0 0
(1) pair lj/cut, perpetual
attributes: half, newton on
pair build: half/bin/atomonly/newton
stencil: half/bin/3d/newton
bin: standard
Setting up Verlet run …
Unit style : lj
Current step : 0
Time step : 0.005
Per MPI rank memory allocation (min/avg/max) = 14.62 | 14.78 | 14.96 Mbytes
Step Time Temp Press PotEng KinEng TotEng Density
0 0 1.44 -5.0196707 -6.7733681 2.1599975 -4.6133706 0.8442
500 2.5 0.73128446 0.46486329 -5.7188157 1.0969254 -4.6218903 0.8442
1000 5 0.70454237 0.69862404 -5.6772826 1.0568123 -4.6204703 0.8442
Loop time of 21.2299 on 28 procs for 1000 steps with 864000 atoms

Performance: 20348.676 tau/day, 47.103 timesteps/s
99.9% CPU use with 28 MPI tasks x 1 OpenMP threads

MPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %total

You must have an old GPU card. This is my timing on just 1 Titan V:

LAMMPS (29 Oct 2020)

KOKKOS mode is enabled (…/kokkos.cpp:90)

will use up to 1 GPU(s) per node

Lattice spacing in x,y,z = 1.6795962 1.6795962 1.6795962

Created orthogonal box = (0.0000000 0.0000000 0.0000000) to (100.77577 100.77577 100.77577)

1 by 1 by 1 MPI processor grid

Created 864000 atoms

create_atoms CPU = 0.112 seconds

Neighbor list info …

update every 20 steps, delay 0 steps, check no

max neighbors/atom: 2000, page size: 100000

master list distance cutoff = 2.8

ghost atom cutoff = 2.8

binsize = 2.8, bins = 36 36 36

1 neighbor lists, perpetual/occasional/extra = 1 0 0

(1) pair lj/cut/kk, perpetual

attributes: full, newton off, kokkos_device

pair build: full/bin/kk/device

stencil: full/bin/3d

bin: kk/device

Setting up Verlet run …

Unit style : lj

Current step : 0

Time step : 0.005

Per MPI rank memory allocation (min/avg/max) = 135.3 | 135.3 | 135.3 Mbytes

Step Temp E_pair E_mol TotEng Press

0 1.44 -6.7733681 0 -4.6133706 -5.0196707

500 0.73128446 -5.7188157 0 -4.6218903 0.46486329

1000 0.70454238 -5.6772827 0 -4.6204704 0.69862378

Loop time of 10.1138 on 1 procs for 1000 steps with 864000 atoms

Ray

Hi Ray,

thanks for your reply, following upon it I changed the number of MPIs to 1 and I got better performance than with
the 28 cores of 1 node. Thus, my question is if LAMMPS Kokkos is expected to scale with the number of MPI/GPUs or
is it meant to be used with a single MPI?

It is supposed to scale for large enough systems (the LJ benchmark with default settings is very small).

At the LAMMPS workshop that is currently happening, there were reports of using KOKKOS with a SNAP potential and a large system to scale to the full system size of a top 10 supercomputer with nvidia GPUs.

KOKKOS assumes one MPI task per GPU. Since it keeps all data on the GPU there is no benefit from using more MPI ranks (unlike for the GPU package where only part of the calculation is on the GPU).