Hi,
I am running the script in.lj from: https://fzj-jsc.github.io/tuning_lammps/05-accelerating-lammps/index.html#learn-to-call-the-gpu-package-from-the-command-line
For the CPU case I am running the script in this way:
srun lmp -in in.lj
while for the GPU case (kokkos) I use:
srun lmp -in in_3.lj -k on g 2 -sf kk -pk kokkos cuda/aware off
performance on CPUs (28 cores) is 20348.676 tau/day while on GPUs 4622.496 tau/day. I have tried several switches but I have failed at getting at least the same performance of CPUs by using GPUs.
Here are the log files, I appreciate any comment.
########### CPU ########################
using 1 OpenMP thread(s) per MPI task
Lattice spacing in x,y,z = 1.6796 1.6796 1.6796
Created orthogonal box = (0 0 0) to (100.776 100.776 100.776)
2 by 2 by 7 MPI processor grid
Created 864000 atoms
create_atoms CPU = 0.00476717 secs
Neighbor list info …
update every 20 steps, delay 0 steps, check no
max neighbors/atom: 2000, page size: 100000
master list distance cutoff = 2.8
ghost atom cutoff = 2.8
binsize = 1.4, bins = 72 72 72
1 neighbor lists, perpetual/occasional/extra = 1 0 0
(1) pair lj/cut, perpetual
attributes: half, newton on
pair build: half/bin/atomonly/newton
stencil: half/bin/3d/newton
bin: standard
Setting up Verlet run …
Unit style : lj
Current step : 0
Time step : 0.005
Per MPI rank memory allocation (min/avg/max) = 14.62 | 14.78 | 14.96 Mbytes
Step Time Temp Press PotEng KinEng TotEng Density
0 0 1.44 -5.0196707 -6.7733681 2.1599975 -4.6133706 0.8442
500 2.5 0.73128446 0.46486329 -5.7188157 1.0969254 -4.6218903 0.8442
1000 5 0.70454237 0.69862404 -5.6772826 1.0568123 -4.6204703 0.8442
Loop time of 21.2299 on 28 procs for 1000 steps with 864000 atoms
Performance: 20348.676 tau/day, 47.103 timesteps/s
99.9% CPU use with 28 MPI tasks x 1 OpenMP threads
MPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %total