KOKKOS error -cudaErrorIllegalAddress)

FTLMD · September 29, 2025, 5:24am

Hello to LAMMPS users,

I am currently facing an error as shown below while running a friction simulation using the KOKKOS package with a NVIDIA RTX 4090 GPU.

"cudaStreamSynchronize(stream) error( cudaErrorIllegalAddress): an illegal memory access was encountered /home/name/lammps-22Jul2025/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:165
Backtrace:
[0x64ad65f2d389]
[0x64ad65f09bb0]
[0x64ad65f33216]
[0x64ad65f33bb9]
[0x64ad65cf3d91]
[0x64ad65cf4138]
[0x64ad65d065f2]
[0x64ad65d08865]
[0x64ad652d62dd]
[0x64ad644b0e14]
[0x64ad63e9e727]
[0x64ad63d6737b]
[0x64ad63d67d7f]
[0x64ad63ccecb1]
[0x76a65ea2a1ca]
[0x76a65ea2a28b] __libc_start_main
[0x64ad63d5a915]
**[DESKTOP-20TF71N:192096] *** Process received signal *****
[DESKTOP-20TF71N:192096] Signal: Aborted (6)
[DESKTOP-20TF71N:192096] Signal code: (-6)
[DESKTOP-20TF71N:192096] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x45330)[0x76a65ea45330]
[DESKTOP-20TF71N:192096] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x11c)[0x76a65ea9eb2c]
[DESKTOP-20TF71N:192096] [ 2] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x1e)[0x76a65ea4527e]
[DESKTOP-20TF71N:192096] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xdf)[0x76a65ea288ff]
[DESKTOP-20TF71N:192096] [ 4] lmp(+0x2469bbd)[0x64ad65f09bbd]
[DESKTOP-20TF71N:192096] [ 5] lmp(+0x2493216)[0x64ad65f33216]
[DESKTOP-20TF71N:192096] [ 6] lmp(+0x2493bb9)[0x64ad65f33bb9]
[DESKTOP-20TF71N:192096] [ 7] lmp(+0x2253d91)[0x64ad65cf3d91]
[DESKTOP-20TF71N:192096] [ 8] lmp(+0x2254138)[0x64ad65cf4138]
[DESKTOP-20TF71N:192096] [ 9] lmp(+0x22665f2)[0x64ad65d065f2]
[DESKTOP-20TF71N:192096] [10] lmp(+0x2268865)[0x64ad65d08865]
[DESKTOP-20TF71N:192096] [11] lmp(+0x18362dd)[0x64ad652d62dd]
[DESKTOP-20TF71N:192096] [12] lmp(+0xa10e14)[0x64ad644b0e14]
[DESKTOP-20TF71N:192096] [13] lmp(+0x3fe727)[0x64ad63e9e727]
[DESKTOP-20TF71N:192096] [14] lmp(+0x2c737b)[0x64ad63d6737b]
[DESKTOP-20TF71N:192096] [15] lmp(+0x2c7d7f)[0x64ad63d67d7f]
[DESKTOP-20TF71N:192096] [16] lmp(+0x22ecb1)[0x64ad63ccecb1]
[DESKTOP-20TF71N:192096] [17] /lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca)[0x76a65ea2a1ca]
[DESKTOP-20TF71N:192096] [18] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b)[0x76a65ea2a28b]
[DESKTOP-20TF71N:192096] [19] lmp(+0x2ba915)[0x64ad63d5a915]
**[DESKTOP-20TF71N:192096] *** End of error message *****
--------------------------------------------------------------------------
prterun noticed that process rank 0 with PID 192096 on node DESKTOP-20TF71N exited on
signal 6 (Aborted).
--------------------------------------------------------------------------"

The current LAMMPS version I am using is Jul22-2025 version
and running in UBUNTU 24.04
The CUDA version is 12.6
And these are the packages I have installed.
“cmake -C …/cmake/presets/basic.cmake -C …/cmake/presets/kokkos-cuda.cmake …/cmake
cmake -D Kokkos_ENABLE_CUDA=yes -D Kokkos_ENABLE_OPENMP=yes -D PKG_KOKKOS=yes -D PKG_MEAM=on -D PKG_MOLECULE=on -D PKG_OPENMP=yes -D GPU_API=cuda -D GPU_ARCH=sm_89 …/cmake”
My input keywords to start the simulation is (just in case)
“mpirun -np 1 lmp -k on g 1 -sf kk -pk kokkos neigh half newton on -in Simulation.lmp”

I have searched this error from here and tried to cool down my GPU when using, (maintaining at about 37~45 degree celsius) but still this error appears every so often.

I still can’t find out why this happens, so it will be grateful if anyone has suggestions or has seen this before.

akohlmey · September 29, 2025, 8:10am

As far as I read the discussion, the conclusion was not the heat, but one defective GPU (out of 4).

The error is a very generic error from a low level library, so it is very difficult to give any suggestions without the ability to reproduce the error or knowing any details about your simulation.
Some questions:

you say you are using the 22 July 2025 version. Is that the original release or the update?
does the same error happen with other input decks, e.g. the LAMMPS bench inputs or some of the examples, or only with this one input?
does your simulation run to completion without errors, when you are not using KOKKOS?

FTLMD · September 29, 2025, 8:24am

It was the original release. Did not check it had an update. Should I try the updated version?
It happens only to these (Indentation/Friction) kind of simulation.
I have done modeling a DLC using the liquid-quenching method with the same potential files, parameter, and KOKKOS, and it ran without error.
Yes, though the amount of atoms and the size of the simulation was different
(past : 4,000 atoms, current : 20,000 atoms), it ran fine using the CPU.

akohlmey · September 29, 2025, 8:36am

You have to check which bugs are fixed. If there is no mention of bugfixes in the KOKKOS package, then the chance is small that it will address your problem.

This only counts, if you run the exact same simulation. The issue could be triggered by your starting configuration.

FTLMD · September 29, 2025, 8:40am

I will try to simulate without using KOKKOS.

Thank you for your time and kind suggestion.

stamoor · September 29, 2025, 3:56pm

This error is basically the same as a segmentation fault on the CPU, and is typically due to either an out of bounds memory access or trying to access host memory inside a device kernel. I will try to reproduce on H100 when I get a chance.

FTLMD · September 30, 2025, 1:26am

@stamoor Thank you for your reply!
Just one question, could the neighbor list size be one of the cause of this memory error?

FTLMD · October 21, 2025, 2:07pm

Dr. Akohlmey

I have checked the update but there seems no mention of bugfixed in the KOKKOS package
I have done the exact same simulation without the KOKKOS package and it runs well.

Would you have any further suggestions for troubleshooting steps I could check next?

Thank you for your continued assistance.

akohlmey · October 21, 2025, 2:43pm

There is not enough information here for a more detailed diagnosis and resulting suggestions.

stamoor · October 22, 2025, 10:17pm

@FTLMD Can you please post a minimal working example of the issue so we can debug? Thank you.

FTLMD · October 23, 2025, 3:06am

Thank you for your assistance.

Currently what I am trying to do a is a friction(sliding) simulation.
A Si tip sliding on the surface of a Zr doped Carbon substrate.
The atoms used are : C, Zr, Si
I have used the hybrid pair style as follows
C-C, C-Zr, Zr-Zr : MEAM potential
Si-Si : Tersoff potential
C-Si, Zr-Si : LJ potential

The simulation process is as

Relaxation
Indentation of the tip
Relaxation
Sliding
Relaxation
Unloading of the tip

Simulation condition :
[ Modelling :
Zr doped substrate
Fixed layer- fixed with move linear keyword
Thermostat layer- NVT 300K
Newtonian layer - NVE

Si tip :
A hemisphere fixed or moving using the move linear keyword.

The normal load/indentation force I am trying to give is 150 nN (approximately 93.59 eV/A)
Indentation and sliding speed : 0.1 A/ps
timestep : 0.25 fs
]

I am trying to speed up the simulation using the KOKKOS package(compiled to a Geforce RTX4090 GPU).
What I first encountered is that the above
Cuda: Illegal memory access pops up when the simulation is going through the indentation/sliding (random but mostly at the indentation step) step.

I have ran the same simulation with CPU and it runs fine without an error.

In addition I have tried

increasing the neighbor list,
slower indentation speed(0.05 A/ps)
reduced the timestep(0.1 fs)
But all have them shows the same error at the indentation/sliding step when using the KOKKOS package.

If there are any other steps I should take or any information you require, please let me know and I’ll respond as soon as possible.
Thank you for your consideration.

stamoor · October 23, 2025, 3:17am

We need a LAMMPS input file and data file–everything to run LAMMPS, not just a text description.

FTLMD · October 23, 2025, 3:36am

I’m terribly sorry, I misunderstood what you meant.

Here are the files,

Data file
Assembly.data (2.2 MB)

Input script
Simulation.lmp (6.0 KB)

Potential files
2005_SiC.tersoff (1.8 KB)
ZrC.meam (703 Bytes)
ZrC_library.meam (519 Bytes)

Thank you Dr.Stamoor

FTLMD · November 17, 2025, 2:31am

Dear Dr. Stan Moore,

I am the user who previously uploaded this issue. I understand you must be very busy, but I am writing to inquire if you have found any clues that might help resolve this problem.

Currently, this specific issue occurs only in simulations that utilize the MEAM potentialwith the Kokkos GPU acceleration on my computer.

To assist in solving this problem, I would like to detail the steps I have taken so far:

My simulation sequence is: Modeling \rightarrow Data File Assembly \rightarrow Main Simulation.
The issue does not occur when I perform the Kokkos GPU acceleration during the Modeling step.
However, this error appears when I run the Main Simulation after assembling the modeled data into a single data file (specifically, when using the MEAM potential).
I have consistently ensured the correct order and units when combining the data files.
The simulation runs successfully without error when executed on the CPU.
Although it is an entirely different simulation, I have not encountered this error when using a Tersoff potential.

Do you have any insights or suspicions regarding what might be causing this error?

Thank you for your time and assistance with this matter.

Respecfully,

FTLMD