Running LAMMPS with KOKKOS


I am trying to run my simulation with kokkos-enabled LAMMPS. I successfully compile lammps with kokkos-cuda successfully using the command below:
cmake -D Kokkos_ARCH_HOSTARCH=yes -D Kokkos_ARCH_GPUARCH=yes -D Kokkos_ENABLE_CUDA=yes -D Kokkos_ENABLE_OPENMP=yes -D CMAKE_CXX_COMPILER=/mylammps/lib/kokkos/bin/nvcc_wrapper -D PKG_MOLECULE=on -D PKG_MANYBODY=on -D PKG_RIGID=on -D PKG_KSPACE=on -D PKG_KOKKOS=on …/cmake

and run lammps using the command below:
mpirun -np 2 mylammps/build_kokkos/lmp -k on g 2 -sf kk -pk kokkos neigh full gpu/aware off newton on -in

However, I ran into unstable pressure (very large magnitude pressure) after I run my NPT simulations. Basically, the minimization step ends up to -inf energy.
Nonetheless, when I run the same input file using LAMMPS compiled with GPU, it runs normal with no issues.

Can someone please help me with this?

Please always report:

  • your LAMMPS version and which packages are included in the compiled binary
  • your host OS and version
  • your GPU hardware and toolkit version

Please also try to run all the benchmark and selected example inputs and report which work and provide consistent results and which fail or diverge significantly.

I am running LAMMPS on a supercomputer.
My lammps version is :LAMMPS (29 Sep 2021 - Update 2)
The host os: RedHat Enterprise Linux 7 Operating System
GPU: NVIDIA Volta V100 GPU’s
Toolkit version: Driver Version: 450.51.05 CUDA Version: 11.0

This is missing the most important information: do you have problems only with your input deck or also with others (and then which ones)?

I tried some examples from the examples directory, such as indent and deposit, and both run fine with kokkos. My only problem is with my own input file which generates the following error:
Running Base_Case for Methane Hydrate
LAMMPS (29 Sep 2021 - Update 2)
KOKKOS mode is enabled (src/KOKKOS/kokkos.cpp:97)
will use up to 1 GPU(s) per node
WARNING: When using a single thread, the Kokkos Serial backend (i.e. Makefile.kokkos_mpi_only) gives better performance than the OpenMP backend (src/KOKKOS/kokkos.cpp:204)
using 1 OpenMP thread(s) per MPI task
Reading data file …
orthogonal box = (0.0000000 0.0000000 0.0000000) to (50.000000 50.000000 50.000000)
1 by 1 by 2 MPI processor grid
reading atoms …
9344 atoms
scanning bonds …
2 = max bonds/atom
scanning angles …
1 = max angles/atom
reading bonds …
5888 bonds
reading angles …
2944 angles
Finding 1-2 1-3 1-4 neighbors …
special bond factors lj: 0 0 0
special bond factors coul: 0 0 0
2 = max # of 1-2 neighbors
1 = max # of 1-3 neighbors
1 = max # of 1-4 neighbors
2 = max # of special neighbors
special bonds CPU = 0.002 seconds
read_data CPU = 0.243 seconds
8832 atoms in group tip4p
Finding SHAKE clusters …
0 = # of size 2 clusters
0 = # of size 3 clusters
0 = # of size 4 clusters
2944 = # of frozen angles
find clusters CPU = 0.026 seconds
8832 atoms in group tip4p
Finding SHAKE clusters …
0 = # of size 2 clusters
0 = # of size 3 clusters
0 = # of size 4 clusters
649 = # of frozen angles
find clusters CPU = 0.001 seconds
New timer settings: style=full mode=nosync timeout=off
PPPM initialization …
extracting TIP4P info from pair style
using 12-bit tables for long-range coulomb (src/kspace.cpp:340)
G vector (1/distance) = 0.30753356
grid = 45 45 45
stencil order = 5
estimated absolute RMS force accuracy = 0.0041685599
estimated relative force accuracy = 1.2553494e-05
using double precision FFTW3
3d grid and FFT values/proc = 81120 46575
Neighbor list info …
update every 1 steps, delay 0 steps, check yes
max neighbors/atom: 2000, page size: 100000
master list distance cutoff = 12.3154
ghost atom cutoff = 12.3154
binsize = 12.3154, bins = 5 5 5
2 neighbor lists, perpetual/occasional/extra = 1 1 0
(1) pair lj/cut/tip4p/long, perpetual
attributes: half, newton on
pair build: half/bin/newton
stencil: half/bin/3d
bin: standard
(2) compute rdf, occasional, copy from (1)
attributes: half, newton on
pair build: copy
stencil: none
bin: none
Setting up Verlet run …
Unit style : real
Current step : 0
Time step : 1
Per MPI rank memory allocation (min/avg/max) = 23.28 | 26.88 | 30.47 Mbytes
Step v_timeMS Temp Press v_D Volume v_Pavg v_nAtoms
0 0 250 -1.5143608e+252 8.3333333e+10 125000 0 9344
ERROR: Out of range atoms - cannot compute PPPM (src/KSPACE/pppm_tip4p.cpp:107)
Last command: run ${NtStepsNVT}
Sun Jan 16 20:06:01 CST 2022

But that is weird because my input file runs okay with gpu-enabled lammps.

Also I see bunch of core.* files being generated in the folder. What are these? Do they have something to do with the MPI tasks?

Forgot to send the minimization outcomes:

Setting up cg/kk style minimization …
Unit style : real
Current step : 0
Per MPI rank memory allocation (min/avg/max) = 20.65 | 24.23 | 27.81 Mbytes
Step Temp E_pair E_mol TotEng Press
0 0 2.8496635e+09 2118.1423 2.8496656e+09 6.2534764e+09
52 0 -2.8891557e+16 1425.8045 -2.8891557e+16 -6.5878333e+15
Loop time of 31.0493 on 2 procs for 52 steps with 9344 atoms

Several comments on those:

  • with less than 10000 atoms, there is only limited benefit to use GPU acceleration, especially when using multiple GPUs. To efficiently use GPUs you need lots of work units, i.e. atoms per GPU.
  • there is no KOKKOS version of the lj/cut/tip4p/long pair style or any other styles for TIP4P, so what you are trying to do is pointless.
  • even with the GPU package, the speedup from using the GPU is going to be limited, specifically considering the kind of GPUs you have. You probably want to oversubscribe the GPUs in that case to have better parallelization in the non-GPU accelerated parts. I would be curious how much GPU acceleration would be possible or if it would not be more effective to run on the CPU.
  • because you are using compute rdf, you will see additional overhead (= slowdown) due to requiring a redundant neighbor list build on the CPU. When using the GPU package, you may want to benchmark if using CPU neighbor lists (and thus computing them only once) may be faster.

overall, it would be helpful to also see the corresponding non-accelerated and GPU package versions of the outputs.

bottom line, for such a tiny system, there is not much be gained and simulations are going to be very fast anyway.

This has nothing to do with LAMMPS directly. Please use google or equivalent to learn where these co-called “core dumps” are originating from.

Thanks very much for your clear explanations.
I had one more concern regarding GPU compiled lammps instead. When I run this input file with GPU package, I am not able to run on more than two MPI processors. When I run on 4 MPI processors for example, I end up the following error:
Cuda driver error 700 in call at file ‘/project/folorode/madibi/lammps-3Mar20/lib/gpu/geryon/nvd_memory.h’ in line 237.
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 3

No idea.