Kokkos on GPU seems not compatible with atom_style full

Amael · April 29, 2024, 3:41pm

Whenever the atom_style full is used with the Kokkos accelerator on GPU
I get a core dumped error with lammps executables that are otherwise able to smoothly (and very efficiently!) run reaxff simulations and other standard simulations on GPU with Kokkos.

A simple way to reproduce the error is to run the in.rhodo bench input file.

I get the same problem with the 21Nov2023 and 17Apr2024 patch releases.
The executables are compiled with both the KOKKOS and MOLECULE packages for V100 nvidia gpus with the CUDA 12.2.91 toolkit and gnu compilers 12.2 (see attached for more informations).

Ultimately I want to run npt simulations with the OPLS force field as (I only show the preamble):

Summary

units real
dimension 3
boundary p p p

atom_style full # core dump

bond_style harmonic
angle_style harmonic
dihedral_style opls
improper_style harmonic

pair_style lj/cut/coul/long 13.0 13.0

pair_modify mix geometric
special_bonds lj/coul 0.0 0.0 0.5
kspace_style pppm 1e-6

which should not be a problem as all the fix and styles are said to work with the kokkos package.

As a side note: I noticed that from the 21Nov2023 to 17Apr2024 patch releases it became necessary to set the FFT_KOKKOS cmake variable to “CUFFT” in order to avoid the default KISS FFT library, which is not mentioned in the documentation about the Kokkos package building.

Best,
Amaël

Summary

– <<< Build configuration >>>
LAMMPS Version: 20231121
Operating System: Linux Red 8.6
CMake Version: 3.25.2
Build type: RELEASE
Generator: Unix Makefiles using /usr/bin/gmake
– Enabled packages: BROWNIAN;CLASS2;CORESHELL;DIELECTRIC;DIPOLE;EXTRA-COMPUTE;EXTRA-DUMP;EXTRA-FIX;EXTRA-MOLECULE;EXTRA-PAIR;FEP;GPU;INTEL;KOKKOS;KSPACE;MANYBODY;MC;MEAM;MISC;MOLECULE;OPENMP;OPT;QTB;REAXFF;REPLICA;RIGID;TALLY
– <<< Compilers and Flags: >>>
– C++ Compiler: /gpfslocalsup/spack_soft/gcc/12.2.0/gcc-8.5.0-ptka3d5gf3nvhwkl6g5bgw7uzksjoywv/bin/g++
Type: GNU
Version: 12.2.0
C++ Flags: -O3 -DNDEBUG -march=cascadelake -mtune=cascadelake
Defines: LAMMPS_SMALLBIG;LAMMPS_MEMALIGN=64;LAMMPS_OMP_COMPAT=4;LAMMPS_JPEG;LAMMPS_PNG;LAMMPS_GZIP;FFT_SINGLE;FFT_MKL;FFT_MKL_THREADS;LMP_OPENMP;$<BUILD_INTERFACE:LMP_KOKKOS>;FFT_CUFFT;LMP_INTEL;LMP_INTEL_USELRT;LMP_USE_MKL_RNG;LMP_GPU
Options: -Xcudafe;–diag_suppress=unrecognized_pragma
– <<< Linker flags: >>>
– Executable name: lmp_impis-V100
– Static library flags:
– <<< MPI flags >>>
– MPI_defines: MPICH_SKIP_MPICXX;OMPI_SKIP_MPICXX;_MPICC_H
– MPI includes: /gpfs7kro/gpfslocalsup/spack_soft/openmpi/4.1.5/gcc-12.2.0-k5b6xwux5o26nktzmvewnoya5metw33z/include;/gpfslocalsup/spack_soft/openmpi/4.1.5/gcc-12.2.0-k5b6xwux5o26nktzmvewnoya5metw33z/include
– MPI libraries: /gpfslocalsup/spack_soft/openmpi/4.1.5/gcc-12.2.0-k5b6xwux5o26nktzmvewnoya5metw33z/lib/libmpi.so;
– <<< GPU package settings >>>
– GPU API: CUDA
– CUDA Compiler: /gpfslocalsys/cuda/12.2.0/bin/nvcc
– GPU default architecture: sm_70
– GPU binning with CUDPP: OFF
– CUDA MPS support: yes
– GPU precision: MIXED
– Kokkos Devices: CUDA;CUDA_LAMBDA;OPENMP;SERIAL
– Kokkos Architecture: ICX;VOLTA70
– <<< FFT settings >>>
– Primary FFT lib: MKL
– Using single precision FFTs
– Using threaded FFTs
– Kokkos FFT: cuFFT

akohlmey · April 29, 2024, 3:42pm

Where is the KSPACE package?
Before running with Kokkos, have you tried running without?
Do you get an error?

Amael · April 29, 2024, 3:46pm

It is as well compiled with the KSPACE package and it runs fine without Kokkos.

stamoor · April 29, 2024, 3:55pm

@Amael thanks for reporting, I will take a look.

stamoor · April 29, 2024, 6:28pm

@Amael I tried running the Rhodo benchmark here: lammps/bench/in.rhodo at 628531dadbdcb63b16591f599d61b004c43a16d5 · lammps/lammps · GitHub which has the “full” atom style and it ran fine for me on a V100 GPU.

./lmp -in in.rhodo -k on g 1 -sf kk -pk kokkos neigh half

LAMMPS (7 Feb 2024 - Development - patch_7Feb2024_update1-312-gedcbd2e-modified)
KOKKOS mode with Kokkos version 4.2.1 is enabled (../kokkos.cpp:72)
  will use up to 1 GPU(s) per node
Reading data file ...
  orthogonal box = (-27.5 -38.5 -36.3646) to (27.5 38.5 36.3615)
  1 by 1 by 1 MPI processor grid
  reading atoms ...
  32000 atoms
  reading velocities ...
  32000 velocities
  scanning bonds ...
  4 = max bonds/atom
  scanning angles ...
  18 = max angles/atom
  scanning dihedrals ...
  40 = max dihedrals/atom
  scanning impropers ...
  4 = max impropers/atom
  reading bonds ...
  27723 bonds
  reading angles ...
  40467 angles
  reading dihedrals ...
  56829 dihedrals
  reading impropers ...
  1034 impropers
Finding 1-2 1-3 1-4 neighbors ...
  special bond factors lj:    0        0        0       
  special bond factors coul:  0        0        0       
     4 = max # of 1-2 neighbors
    12 = max # of 1-3 neighbors
    24 = max # of 1-4 neighbors
    26 = max # of special neighbors
  special bonds CPU = 0.026 seconds
  read_data CPU = 1.958 seconds
Finding SHAKE clusters ...
    1617 = # of size 2 clusters
    3633 = # of size 3 clusters
     747 = # of size 4 clusters
    4233 = # of frozen angles
  find clusters CPU = 0.009 seconds
PPPM Kokkos initialization ...
  using 12-bit tables for long-range coulomb (../kspace.cpp:342)
  G vector (1/distance) = 0.24883488
  grid = 25 32 32
  stencil order = 5
  estimated absolute RMS force accuracy = 0.035547797
  estimated relative force accuracy = 0.00010705113
  using double precision cuFFT
  3d grid and FFT values/proc = 41070 25600
Generated 2278 of 2278 mixed pair_coeff terms from arithmetic mixing rule
Neighbor list info ...
  update: every = 1 steps, delay = 5 steps, check = yes
  max neighbors/atom: 2000, page size: 100000
  master list distance cutoff = 12
  ghost atom cutoff = 12
  binsize = 12, bins = 5 7 7
  1 neighbor lists, perpetual/occasional/extra = 1 0 0
  (1) pair lj/charmm/coul/long/kk, perpetual
      attributes: half, newton off, kokkos_device
      pair build: half/bin/newtoff/kk/device
      stencil: full/bin/3d
      bin: kk/device
Setting up Verlet run ...
  Unit style    : real
  Current step  : 0
  Time step     : 2
Per MPI rank memory allocation (min/avg/max) = 137.7 | 137.7 | 137.7 Mbytes
------------ Step              0 ----- CPU =            0 (sec) -------------
TotEng   =    -25356.2064 KinEng   =     21444.8313 Temp     =       299.0397 
PotEng   =    -46801.0377 E_bond   =      2537.9940 E_angle  =     10921.3742 
E_dihed  =      5211.7865 E_impro  =       213.5116 E_vdwl   =     -2307.8634 
E_coul   =    207025.8927 E_long   =   -270403.7333 Press    =      -149.3301 
Volume   =    307995.0335
------------ Step             50 ----- CPU =    0.3574469 (sec) -------------
TotEng   =    -25330.0317 KinEng   =     21501.0005 Temp     =       299.8229 
PotEng   =    -46831.0321 E_bond   =      2471.7035 E_angle  =     10836.5102 
E_dihed  =      5239.6320 E_impro  =       227.1218 E_vdwl   =     -1993.2873 
E_coul   =    206797.6802 E_long   =   -270410.3926 Press    =       237.6571 
Volume   =    308031.6762
------------ Step            100 ----- CPU =    0.7259853 (sec) -------------
TotEng   =    -25290.7303 KinEng   =     21591.9084 Temp     =       301.0906 
PotEng   =    -46882.6387 E_bond   =      2567.9807 E_angle  =     10781.9571 
E_dihed  =      5198.7492 E_impro  =       216.7864 E_vdwl   =     -1902.6618 
E_coul   =    206659.5227 E_long   =   -270404.9730 Press    =         6.7407 
Volume   =    308134.2285
Loop time of 0.726049 on 1 procs for 100 steps with 32000 atoms

Performance: 23.800 ns/day, 1.008 hours/ns, 137.732 timesteps/s, 4.407 Matom-step/s
74.0% CPU use with 1 MPI tasks x 1 OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 0.25081    | 0.25081    | 0.25081    |   0.0 | 34.54
Bond    | 0.038983   | 0.038983   | 0.038983   |   0.0 |  5.37
Kspace  | 0.17672    | 0.17672    | 0.17672    |   0.0 | 24.34
Neigh   | 0.10286    | 0.10286    | 0.10286    |   0.0 | 14.17
Comm    | 0.02486    | 0.02486    | 0.02486    |   0.0 |  3.42
Output  | 0.00025486 | 0.00025486 | 0.00025486 |   0.0 |  0.04
Modify  | 0.10756    | 0.10756    | 0.10756    |   0.0 | 14.82
Other   |            | 0.024      |            |       |  3.31

Nlocal:          32000 ave       32000 max       32000 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Nghost:          47958 ave       47958 max       47958 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Neighs:    1.43049e+07 ave 1.43049e+07 max 1.43049e+07 min
Histogram: 1 0 0 0 0 0 0 0 0 0

Total # of neighbors = 14304913
Ave neighs/atom = 447.02853
Ave special neighs/atom = 7.431875
Neighbor list builds = 11
Dangerous builds = 0
Total wall time: 0:00:06

Can you post the output from the core dump and your exact input script? How many GPUs are you using?

Amael · April 30, 2024, 2:41pm

Thank you for having a look at this. I’m using one gpu and like you have used the neigh half option.
I said it runs fine without kokkos, with an executable that was compiled without the Kokkos package.
In fact, the executable compiled with the Kokkos package enabled does not work when the GPU package is used (core dumped), which works fine with the other executable…

I spotted the faulty option: Kokkos_ARCH_ICX
My cmake options are attached in case this is due to a bad interaction with other settings.
It runs perfectly fine without (I only tested the 17Apr2024 version).
Allowing this option only generates problem with the KOKKOS or GPU accelerator when using the atom_style full,though I did not investigate that much the GPU accelerator.

Launching on a V100 GPU w. a Cascade Lake 6248 CPU the following:

lmp -in in.rhodo      
lmp -in in.rhodo -sf gpu
lmp -in in.rhodo -k on g 1 t 1 -sf kk  -pk kokkos neigh half

run fine and give me:

Performance: 0.548 ns/day, 43.796 hours/ns, 3.171 timesteps/s, 101.480 katom-step/s 
Performance: 8.502 ns/day, 2.823 hours/ns, 49.200 timesteps/s, 1.574 Matom-step/s
Performance: 20.208 ns/day, 1.188 hours/ns, 116.942 timesteps/s, 3.742 Matom-step/s

respectively.

Thank you again for your help,
Amaël

Summary

set(CMAKE_BUILD_TYPE "RELEASE" CACHE STRING "" FORCE)

set(FFT_SINGLE ON CACHE BOOL "" FORCE) # Could be OFF -> double precision
set(FFT "MKL" CACHE STRING "" FORCE)   #### set(FFT "FFTW3" CACHE STRING "" FORCE)
set(FFT_KOKKOS "CUFFT" CACHE STRING "" FORCE)   
set(FFT_MKL_THREADS "on" CACHE STRING "" FORCE)

set(GPU_PREC      "mixed"  CACHE STRING "" FORCE) 
set(GPU_API       "cuda"    CACHE STRING "" FORCE)
set(GPU_ARCH      "sm_70" CACHE STRING "" FORCE) 
set(CUDA_ENABLE_MULTIARCH "no" CACHE STRING "" FORCE) 
set(CUDA_MPS_SUPPORT "yes" CACHE STRING "" FORCE) 

set(Kokkos_ENABLE_AGGRESSIVE_VECTORIZATION ON  CACHE BOOL "" FORCE) 
set(Kokkos_ENABLE_DEPRECATION_WARNINGS OFF CACHE BOOL "" FORCE) 
set(Kokkos_ENABLE_SERIAL ON  CACHE BOOL "" FORCE)  
set(Kokkos_ENABLE_OPENMP ON  CACHE BOOL "" FORCE)
#TROUBLE#set(Kokkos_ARCH_ICX  on CACHE BOOL "" FORCE) 
set(Kokkos_ENABLE_CUDA   ON CACHE BOOL "" FORCE)
set(Kokkos_ENABLE_HIP    OFF CACHE BOOL "" FORCE) 
set(Kokkos_ARCH_VOLTA70 ON CACHE BOOL "" FORCE)

#17Apr2024#set(Kokkos_ENABLE_CUDA_UVM OFF CACHE BOOL "" FORCE)
#set(Kokkos_ENABLE_HIP_MULTIPLE_KERNEL_INSTANTIATIONS ON CACHE BOOL "" FORCE)


set(BUILD_OMP ON CACHE BOOL "" FORCE)
#enables Intel compilers with support for MPI and OpenMP (on Linux boxes)
set(CMAKE_CXX_COMPILER     "g++" CACHE STRING "" FORCE)
set(CMAKE_C_COMPILER       "gcc"  CACHE STRING "" FORCE)
set(CMAKE_Fortran_COMPILER "gfortran"  CACHE STRING "" FORCE)



set(CMAKE_INSTALL_PREFIX:PATH "$FINAL_INSTALL")
set(CMAKE_CXX_FLAGS_RELEASE     "-O3 -DNDEBUG -march=cascadelake -mtune=cascadelake" CACHE STRING "" FORCE)
set(CMAKE_C_FLAGS_RELEASE       "-O3 -DNDEBUG -march=cascadelake -mtune=cascadelake" CACHE STRING "" FORCE)
set(CMAKE_Fortran_FLAGS_RELEASE "-O3 -DNDEBUG -march=cascadelake -mtune=cascadelake" CACHE STRING "" FORCE)

set(CMAKE_CXX_FLAGS_DEBUG     "-Wall -Wextra -g" CACHE STRING "" FORCE)
set(CMAKE_Fortran_FLAGS_DEBUG "-Wall -Wextra -g" CACHE STRING "" FORCE)
set(CMAKE_C_FLAGS_DEBUG       "-Wall -Wextra -g" CACHE STRING "" FORCE)

set(MPI_CXX          "g++"   CACHE STRING "" FORCE)
set(MPI_CXX_COMPILER "mpicxx" CACHE STRING "" FORCE)


unset(HAVE_OMP_H_INCLUDE CACHE)
set(OpenMP_C   "gcc" CACHE STRING "" FORCE)
set(OpenMP_CXX "g++" CACHE STRING "" FORCE)
set(OpenMP_C_FLAGS       "-fopenmp" CACHE STRING "" FORCE)
set(OpenMP_CXX_FLAGS     "-fopenmp" CACHE STRING "" FORCE)
set(OpenMP_Fortran_FLAGS "-fopenmp" CACHE STRING "" FORCE)

set(OpenMP_C_LIB_NAMES   "omp"         CACHE STRING "" FORCE)
set(OpenMP_CXX_LIB_NAMES "omp"         CACHE STRING "" FORCE)
set(OpenMP_omp_LIBRARY   "libiomp5.so" CACHE PATH "" FORCE)

stamoor · April 30, 2024, 3:31pm

Hmmm, if your CPU doesn’t support the vector flag you are using, it could crash. For example I’ve seen a crash on Haswell when accidentally compiling for KNL AVX512, it gave an “illegal instruction” error or something like that. Glad you got it working.

Amael · April 30, 2024, 3:47pm

It is surprising that it worked so far for all other simulations. I mixed up the optimization flags between the different GPU partitions I’m using…