Hi,
I am new in this mailing list. The simulation I am trying is a protein in water with Na and Cl
ions. I am using the pppm algorithm for electrostatics. The LAMMPS version I am running
supports Kokkos for GPUs.
I already tried a simpler model consisting of a LJ simulation (without long-range
electrostatics) and I could see a drastic improvement upon using GPUs w.r.t. pure CPU
LAMMPS.
For the more realistic case of the protein described previously, I noticed that the simulation
time increased w.r.t. pure CPU case. I am requesting the following resources on SLURM:
#SBATCH -n 28
#SBATCH -c 1
#SBATCH --gres=gpu:k80
module load lammps-kokkos
srun lmp -in input.inp -k on g 2 -sf kk -pk kokkos newton off neigh full comm device
the logfile tells me that Kspace is the major bottleneck of the simulation:
MPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %total
there is not enough information here to make an informed suggestion.
please provide the full log file or at least everything up to and including the first thermo output
thanks,
axel.
Hi,
here is the output I have:
KOKKOS mode is enabled (src/KOKKOS/kokkos.cpp:85)
will use up to 2 GPU(s) per node
using 1 OpenMP thread(s) per MPI task
package kokkos
package kokkos newton off neigh full comm device
#package gpu 1 split -1 # input script command
echo screen
Switching to CHARMM coulomb energy conversion constant (src/KSPACE/pair_lj_charmmfsw_coul_long.cpp:68)
orthogonal box = (-59.5 -59.5 -59.5) to (59.5 59.5 59.5)
2 by 2 by 7 MPI processor grid
reading atoms …
158944 atoms
scanning bonds …
4 = max bonds/atom
scanning angles …
15 = max angles/atom
scanning dihedrals …
44 = max dihedrals/atom
scanning impropers …
4 = max impropers/atom
reading bonds …
106440 bonds
reading angles …
55775 angles
reading dihedrals …
6081 dihedrals
reading impropers …
373 impropers
4 = max # of 1-2 neighbors
9 = max # of 1-3 neighbors
19 = max # of 1-4 neighbors
21 = max # of special neighbors
special bonds CPU = 0.0128215 secs
read_data CPU = 4.6905 secs
orthogonal box = (-59.5 -59.5 -59.5) to (59.5 59.5 59.5)
158944 atoms before read
158944 atoms in snapshot
0 atoms purged
158944 atoms replaced
0 atoms trimmed
0 atoms added
158944 atoms after read
PPPM initialization …
using 12-bit tables for long-range coulomb (src/kspace.cpp:332)
G vector (1/distance) = 0.2772
grid = 128 128 128
stencil order = 5
estimated absolute RMS force accuracy = 0.000451205
estimated relative force accuracy = 1.35876e-06
using double precision cuFFT
3d grid and FFT values/proc = 131066 81920
Neighbor list info …
update every 1 steps, delay 5 steps, check yes
max neighbors/atom: 2000, page size: 100000
master list distance cutoff = 14
ghost atom cutoff = 14
binsize = 14, bins = 9 9 9
1 neighbor lists, perpetual/occasional/extra = 1 0 0
(1) pair lj/charmmfsw/coul/long, perpetual
attributes: half, newton off
pair build: half/bin/newtoff
stencil: half/bin/3d/newtoff
bin: standard
WARNING: No fixes defined, atoms won’t move (src/verlet.cpp:52)
PPPM initialization …
using 12-bit tables for long-range coulomb (src/kspace.cpp:332)
G vector (1/distance) = 0.2772
grid = 128 128 128
stencil order = 5
estimated absolute RMS force accuracy = 0.000451205
estimated relative force accuracy = 1.35876e-06
using double precision cuFFT
3d grid and FFT values/proc = 131066 81920
WARNING: Inconsistent image flags (src/domain.cpp:785)
Per MPI rank memory allocation (min/avg/max) = 84.86 | 87.13 | 88.78 Mbytes
Step Time Xlo Xhi Ylo Yhi Zlo Zhi TotEng PotEng KinEng Temp Press E_bond E_angle E_dihed E_impro E_vdwl E_coul E_
long Temp Volume
0 0 -59.5 59.5 -59.5 59.5 -59.5 59.5 -558217.2
-701843.27 143626.07 303.15 -807.22378 25620.186 14861.293 1089.2972 12.799179 105845
.42 2005820.3 -2855092.6 303.15 1685159
10 10 -59.5 59.5 -59.5 59.5 -59.5 59.5 -558217.2
-701843.27 143626.07 303.15 -807.22378 25620.186 14861.293 1089.2972 12.799179 105845
.42 2005820.3 -2855092.6 303.15 1685159
Loop time of 12.0415 on 28 procs for 10 steps with 158944 atoms
Performance: 0.072 ns/day, 334.486 hours/ns, 0.830 timesteps/s
99.8% CPU use with 28 MPI tasks x 1 OpenMP threads
MPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %total
thanks.
I see two problems here:
- you are using a pair style lj/charmmfsw/coul/long that is not supported by KOKKOS, which means that atom data will need to move back and forth between the host memory and the GPU memory.
- you are massively oversubscribing the GPUs. when using KOKKOS it is advisable to use only one MPI task per GPU. oversubscribing will only help to speed up non accelerated code and then you should use the CUDA MPS to reduce the overhead from oversubscribing the GPUs. please see more information about this in the documentation: https://docs.lammps.org/Speed_kokkos.html
in summary, you are likely better off for your input to either not use KOKKOS (and GPUs) at all or change it to use a supported pair style.
axel.