Error running KOKKOS on Ubuntu 22.04

Hi everyone,

There are currently some issues with the cuda and default g++ version in Ubuntu 22.04. This can affect the compilation of Kokkos (see here). My workaround was to set the default compiler to g++-10 using the following cmake presets:

cmake -C ../cmake/presets/mine.cmake -C ../cmake/presets/kokkos-cuda.cmake -D WITH_JPEG=yes -D WITH_PNG=yes -D WITH_FFMPEG=yes -DCMAKE_CXX_STANDARD=17 -DCMAKE_CXX_COMPILER=g++-10 ../cmake/

See here the cmake files:

mine.cmake (which is just all_on.cmake with fewer packages)

Preset that turns on all existing packages. Using the combination

of this preset followed by the nolib.cmake preset should configure

a LAMMPS binary, with as many packages included, that can be compiled

with just a working C++ compiler and an MPI library.

set(ALL_PACKAGES
AMOEBA
ASPHERE
ASPHERE
ATC
AWPMD
BOCS
BODY
BPM
BROWNIAN
CG-DNA
CG-SPICA
CLASS2
COLLOID
COLVARS
COMPRESS
CORESHELL
DIELECTRIC
DIFFRACTION
DIPOLE
DPD-BASIC
DPD-MESO
DPD-REACT
DPD-SMOOTH
DRUDE
ELECTRODE
EFF
EXTRA-COMPUTE
EXTRA-DUMP
EXTRA-FIX
EXTRA-MOLECULE
EXTRA-PAIR
FEP
GPU
GRANULAR
INTEL
INTERLAYER
KIM
KOKKOS
KSPACE
LATBOLTZ
LEPTON
MANIFOLD
MANYBODY
MC
MEAM
MESONT
MGPT
MISC
ML-HDNNP
ML-IAP
ML-POD
ML-RANN
ML-SNAP
MOFFF
MOLECULE
MOLFILE
MPIIO
OPENMP
OPT
ORIENT
PERI
PHONON
PLUGIN
POEMS
PTM
PYTHON
QEQ
QMMM
QTB
REACTION
REAXFF
REPLICA
RIGID
SHOCK
SMTBQ
SPH
SPIN
SRD
TALLY
UEF
VORONOI
VTK
YAFF)

foreach(PKG {ALL_PACKAGES}) set(PKG_{PKG} ON CACHE BOOL “” FORCE)
endforeach()

kokko-cuda.cmake
# preset that enables KOKKOS and selects CUDA compilation with OpenMP
# enabled as well. This preselects CC 5.0 as default GPU arch, since
# that is compatible with all higher CC, but not the default CC 3.5
set(PKG_KOKKOS ON CACHE BOOL "" FORCE)
set(Kokkos_ENABLE_SERIAL ON CACHE BOOL "" FORCE)
set(Kokkos_ENABLE_CUDA   ON CACHE BOOL "" FORCE)
set(Kokkos_ARCH_VOLTA70 ON CACHE BOOL "" FORCE)
set(BUILD_OMP ON CACHE BOOL "" FORCE)

# hide deprecation warnings temporarily for stable release
set(Kokkos_ENABLE_DEPRECATION_WARNINGS OFF CACHE BOOL "" FORCE)

However when running the following script I get a segfault error:

input script
units metal
boundary p p p                                                                        
pair_style snap/kk
                                           
read_data mo.lmp        

pair_coeff * * Mo_Zuo_JPCA2020.snapcoeff Mo_Zuo_JPCA2020.snapparam Mo
                                           
velocity all create 300 9284 
velocity all zero linear
                                           
thermo 1
min_style cg/kk                   
minimize 1e-6 1e-8 1000 10000      

reset_timestep 0

dump 1 all atom 100 dump.lammpstrj
fix NVT all nvt/kk temp 300 300 0.1     
run 10000

write_data data.out.lmp

I saw that the error comes from the core of Kokkos by running the following gdb command:
gdb --args lmp -k on g 1 -pk kokkos newton on neigh half comm no -i in.lmp.

GDB error
Thread 1 "lmp" received signal SIGSEGV, Segmentation fault.
Kokkos::DualView<double**, Kokkos::LayoutRight, Kokkos::Cuda, void>::sync_impl<Kokkos::Cuda> (this=0x555585fdef98) at /home/germain/Documents/Codes/lammps/lib/kokkos/core/src/impl/Kokkos_ViewMapping.hpp:4060
4060      KOKKOS_FUNCTION RuntimeCheckViewMemoryAccessViolation(char const* const,

The LAMMPS version is 3 Aug 2023 - Development - patch_7Jan2022-8294-gf0801338f3-modified compiled from source. I don’t know if this stems from a mismatch between my graphic card and the Kokkos compilation configuration or if this is a true bug in the code (which would require an issue opening on the Kokkos part). My GPU is a NVIDIA RTX 3500 with the ADA architecture and Driver Version: 535.113.01 CUDA Version: 12.2 drivers installed. Ubuntu repository does not allow for nvcc 12 installation which has the Ada architecture supported (sm_89 option). However this would require adding the debien unstable repo to this Ubuntu computer, something I am reluctant do to now since I am unsure that the error comes out of this.

Any idea on a solution?

If you provide the rest of the files needed to run your example, I can try to reproduce this.

example_kk.zip (3.6 KB)

For sure, please find attached a zip archive with the data file used and the SNAP potential I tried to used.

This runs fine with the standard version of the SNAP package.

I can reproduce this, however the error goes away if you add -sf kk to the input command, i.e.:
lmp -k on g 1 -pk kokkos newton on neigh half comm no -i in.lmp -sf kk

I have no idea why this resolves the issue though. :slight_smile:

1 Like

Using -sf kk is much preferred. If you try to manually add /kk to every style like in the input above, you can’t forget the atom_style as well as the verlet run style, and maybe others that are implicitly defined. It is fraught with peril but should error out though if you make a mistake.

I can confirm that adding the -sf kk to the command line solves the crash issue (thanks @mkanski!). However I get different way results with and without Kokkos+GPU.

Using the standard CPU parallelization everything runs ok. Using Kokkos, particles just do not see one another. The potential energy value is way off and particles basically fly around without collision. So I think there is a problem with the neighbor lists. I tried setting a cutoff twice as high but this does not change anything. I think opening an issue for this on GitHub would be more appropriate, so I’ll do that. Many thanks for the comments.

When I run the input without -sf kk I get a segmentation fault on V100. I will make it error out more gracefully.

I will also debug the numerical issues.

It should error out gracefully now without -sf kk with the commit here: Collected small fixes and updates by akohlmey · Pull Request #3943 · lammps/lammps · GitHub

1 Like

I fixed the bug: Collected small fixes and updates by akohlmey · Pull Request #3943 · lammps/lammps · GitHub. The number of atom types wasn’t set yet in the constructor, leading to an out of bounds access.

1 Like