LAMMPS Kokkos GPU: cudaErrorIllegalAddress during neighbor build (NBinKokkos::bin_atoms)

Qixuan · August 11, 2025, 12:29pm

I’m hitting a GPU crash when running LAMMPS with Kokkos. The error happens during neighbor construction / atom binning and aborts with cudaErrorIllegalAddress.

cudaDeviceSynchronize() error( cudaErrorIllegalAddress): an illegal memory access was encountered /home/manav/Softwares/ml/lammps/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:154
Backtrace:
[0x73e31284c289] Kokkos::Impl::save_stacktrace()
[0x73e312828800] Kokkos::Impl::host_abort(char const*)
[0x73e31285278b] Kokkos::Impl::cuda_internal_error_abort(cudaError, char const*, char const*, int)
[0x73e3128529ca] Kokkos::Impl::cuda_device_synchronize(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&)
[0x73e31282a11d] Kokkos::Impl::ExecSpaceManager::static_fence(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&)
[0x73e311af5bb5] LAMMPS_NS::NBinKokkosKokkos::Cuda::bin_atoms()
[0x73e311a7ff9c] void LAMMPS_NS::NeighborKokkos::build_kokkosKokkos::Cuda(int)
[0x73e312103926] LAMMPS_NS::VerletKokkos::run(int)
[0x73e31182774c] LAMMPS_NS::Run::command(int, char**)
[0x73e311615afd] LAMMPS_NS::Input::execute_command()
[0x73e311615ef6] LAMMPS_NS::Input::file()
[0x58944bf775bc]
[0x73e306629d90]
[0x73e306629e40] __libc_start_main
[0x58944bf77665]
Aborted (core dumped)

Thanks a lot for any pointers!

akohlmey · August 11, 2025, 1:49pm

Without context and a way to reproduce it, we have no chance to point to anything.

We need to know:

what is your LAMMPS version? is this “vanilla LAMMPS” or did you add any external packages not distributed as part of LAMMPS?
how exactly did you compile it with which settings?
what are your compiler and CUDA toolkit versions?
what is the output of lmp -h?
what kind of platform (OS, CPU architecture and version) are you running?
what kind of GPU do you have and how many per node?
what is your exact command line?
what is the specific input that triggers this error?
do you get the same crash when you run any of the LAMMPS benchmark or example inputs?

Qixuan · August 11, 2025, 2:51pm

Thank you for your reply.

My version is LAMMPS (22 Jul 2025 - Development), Branch: develop, it’s vanilla LAMMPS.
cmake -C kokkos-cuda.cmake
-D CMAKE_C_COMPILER=$MPICC
-D CMAKE_CXX_COMPILER=MPICXX \ -D CMAKE_BUILD_TYPE=Release \ -D CMAKE_INSTALL_PREFIX=(pwd)
-D BUILD_MPI=ON
-D PKG_ML-IAP=ON
-D PKG_ML-SNAP=ON
-D BUILD_SHARED_LIBS=ON
-D MLIAP_ENABLE_PYTHON=ON
-D PKG_PYTHON=ON
…/cmake
Compilers / MPI

MPI C: Intel oneAPI MPI 2021.14
MPI C++: Intel MPI wrappers → GCC 11.4.0
C/C++ compiler used: GCC 11.4.0
CUDA
CUDA toolkit (nvcc): 11.8 (V11.8.89)
nvidia-smi shows CUDA 12.2
GPU: NVIDIA L40 (46 GB)

lammps_help.txt (6.8 KB)
5. Ubuntu 22.04.5 LTS (Jammy), x86_64, Linux kernel 6.8.0-60-generic.
6. GPUs per node: 1 × NVIDIA L40.
7. command line: lmp -k on g 1 -sf kk -in myinput.in
8. my input
units metal
atom_style atomic
atom_modify map yes
newton on
read_data myfile.data
pair_style mliap unified mymodel.pt 0
pair_coeff * * C Na
timestep 0.001
thermo 100
dump myDump all custom 100 dump.mace id type x y z
dump_modify myDump sort id
fix 1 all nvt temp 300 300 100
run 1000
9. I don’t see this error on benchmarks.

Thanks again for your help!

stamoor · August 11, 2025, 2:52pm

Often this occurs when your system blows up, similar to the lost atoms or non-numeric pressure errors.