Hello LAMMPS developers,
I am encountering a CUDA illegal memory access when running a DEM simulation using Kokkos (CUDA, no UVM) with gran/hooke/history/kk and NEIGH_HISTORY/KK/DEVICE
Error:
^Ck0006390@nal-005:/mnt/gs21/scratch/groups/wmich/lammpsrun/run$ mpirun -np 4 --bind-to core --map-by ppr:4:node:pe=1 -x UCX_IB_GPU_DIRECT_RDMA=yes -x UCX_WARN_UNUSED_ENV_VARS=n ~/lammps/builda100cudaawarenouvm/lmp -k on g 4 -sf kk -pk kokkos neigh half comm device -in in.settle_restart
LAMMPS (10 Dec 2025 - Development - patch_10Dec2025-120-ga51f9ba0e7)
KOKKOS mode with Kokkos version 5.0.0 is enabled
using double precision
using view layout = legacy
will use up to 4 GPU(s) per node
using 1 OpenMP thread(s) per MPI task
Reading restart file …
restart file = 10 Dec 2025, LAMMPS = 10 Dec 2025
restoring atom style sphere/kk from restart
orthogonal box = (-0.12 -0.4 -0.4) to (0.16 0.4 0.4)
1 by 2 by 2 MPI processor grid
restoring pair style gran/hooke/history/kk from restart
7223825 atoms
read_restart CPU = 5.820 seconds
Resetting peratom fix info from restart file:
fix style: wall/gran/kk, fix ID: wall_floor
CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE
Your simulation uses code contributions which should be cited:
- KOKKOS package: https://doi.org/10.1145/3731599.3767498
The log file lists these citations in BibTeX format.
CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE
Resetting global fix info from restart file:
fix style: NEIGH_HISTORY/KK/DEVICE, fix ID: NEIGH_HISTORY_HH0
Resetting peratom fix info from restart file:
fix style: NEIGH_HISTORY/KK/DEVICE, fix ID: NEIGH_HISTORY_HH0
Generated 0 of 0 mixed pair_coeff terms from geometric mixing rule
All restart file global fix info was re-assigned
All restart file peratom fix info was re-assigned
Neighbor list info …
update: every = 1 steps, delay = 0 steps, check = yes
max neighbors/atom: 10000, page size: 100000
master list distance cutoff = 0.0025
ghost atom cutoff = 0.0025
binsize = 0.0025, bins = 113 320 320
1 neighbor lists, perpetual/occasional/extra = 1 0 0
(1) pair gran/hooke/history/kk, perpetual
attributes: half, newton off, size, history, kokkos_device
pair build: half/bin/newtoff/size/kk/device
stencil: full/bin/3d
bin: kk/device
Setting up Verlet run …
Unit style : si
Current step : 0
Time step : 1e-08
Per MPI rank memory allocation (min/avg/max) = 3817 | 3884 | 3977 Mbytes
Step Atoms Dt Time c_fmax[1] c_fmax[2] c_fmax[3] CPU
0 7223825 1e-08 0 1.7014669e-06 0 0 0
20000 7223825 1e-08 0.0002 2.9694914e-06 1.8284306e-05 1.9164159e-05 113.87738
40000 7223825 1e-08 0.0004 2.9693954e-06 1.8282673e-05 1.9162414e-05 227.46573
60000 7223825 1e-08 0.0006 2.9692994e-06 1.828104e-05 1.916067e-05 360.13538
80000 7223825 1e-08 0.0008 2.9692034e-06 1.8279408e-05 1.9158926e-05 473.66435
100000 7223825 1e-08 0.001 2.9691074e-06 1.8277776e-05 1.9157182e-05 587.15008
120000 7223825 1e-08 0.0012 2.9690115e-06 1.8276144e-05 1.9155438e-05 718.71947
140000 7223825 1e-08 0.0014 2.9689155e-06 1.8274513e-05 1.9153695e-05 831.7781
160000 7223825 1e-08 0.0016 2.9688195e-06 1.8272881e-05 1.9151952e-05 963.32055
180000 7223825 1e-08 0.0018 2.9687235e-06 1.827125e-05 1.9150209e-05 1076.073
200000 7223825 1e-08 0.002 2.9686276e-06 1.8269619e-05 1.9148466e-05 1188.9594
220000 7223825 1e-08 0.0022 2.9685316e-06 1.8267989e-05 1.9146724e-05 1324.0336
240000 7223825 1e-08 0.0024 2.9684356e-06 1.8266358e-05 1.9144981e-05 1436.8825
260000 7223825 1e-08 0.0026 0.0070449508 0.01971426 0.0078929965 1573.9749
280000 7223825 1e-08 0.0028 2.9682437e-06 1.8274019e-05 1.914909e-05 1686.9452
300000 7223825 1e-08 0.003 0.0053569413 0.0067519438 0.010838032 1799.7158
320000 7223825 1e-08 0.0032 0.0020133844 0.0087335999 0.020193034 1932.5763
340000 7223825 1e-08 0.0034 0.0032335168 0.0027114475 0.01556825 2045.8299
cudaStreamSynchronize(stream) error( cudaErrorIllegalAddress): an illegal memory access was encountered /mnt/home/k0006390/lammps/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:157
Backtrace:
[0x178ae65]
[0x176491c]
[0x1791963]
[0x1791b37]
[0x876a15]
[0xc7756d]
[0x63c16c]
[0x4ef845]
[0x4eff4e]
[0x48252d]
[0x147eb6ad9d90]
[0x147eb6ad9e40] __libc_start_main
[0x4e40a5]
[nal-005:1782865] *** Process received signal ***
[nal-005:1782865] Signal: Aborted (6)
[nal-005:1782865] Signal code: (-6)
[nal-005:1782865] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x147eb6af2520]
[nal-005:1782865] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x147eb6b469fc]
[nal-005:1782865] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x147eb6af2476]
[nal-005:1782865] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x147eb6ad87f3]
[nal-005:1782865] [ 4] /mnt/home/k0006390/lammps/builda100cudaawarenouvm/lmp[0x1764929]
[nal-005:1782865] [ 5] /mnt/home/k0006390/lammps/builda100cudaawarenouvm/lmp[0x1791963]
[nal-005:1782865] [ 6] /mnt/home/k0006390/lammps/builda100cudaawarenouvm/lmp[0x1791b37]
[nal-005:1782865] [ 7] /mnt/home/k0006390/lammps/builda100cudaawarenouvm/lmp[0x876a15]
[nal-005:1782865] [ 8] /mnt/home/k0006390/lammps/builda100cudaawarenouvm/lmp[0xc7756d]
[nal-005:1782865] [ 9] /mnt/home/k0006390/lammps/builda100cudaawarenouvm/lmp[0x63c16c]
[nal-005:1782865] [10] /mnt/home/k0006390/lammps/builda100cudaawarenouvm/lmp[0x4ef845]
[nal-005:1782865] [11] /mnt/home/k0006390/lammps/builda100cudaawarenouvm/lmp[0x4eff4e]
[nal-005:1782865] [12] /mnt/home/k0006390/lammps/builda100cudaawarenouvm/lmp[0x48252d]
[nal-005:1782865] [13] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x147eb6ad9d90]
[nal-005:1782865] [14] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x147eb6ad9e40]
[nal-005:1782865] [15] /mnt/home/k0006390/lammps/builda100cudaawarenouvm/lmp[0x4e40a5]
[nal-005:1782865] *** End of error message ***