Cuda driver error 1 in call at file 'geryon/nvd_kernel.h' in line 338

Sailang · February 1, 2023, 4:52am

Hi,
I met with an error when I updated my nvidia cuda version to cuda-12.0, and LAMMPS version is 22Dec2022 (23Jun2022 has also been tried). The lmp_mpi was re-compiled.
The error information is:
Initializing Device and compiling on process 0…Done.
Initializing Device 0 on core 0…Done.
Initializing Device 0 on core 1…Done.

Generated 0 of 1 mixed pair_coeff terms from geometric mixing rule
Setting up sd style minimization …
Unit style : metal
Current step : 0
Cuda driver error 1 in call at file ‘geryon/nvd_kernel.h’ in line 338.
Cuda driver error 1 in call at file ‘geryon/nvd_kernel.h’ in line 338.
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1

Because the same file has been successfully run when the cuda version is cuda-11.7 (LAMMPS version is 23Jun2022), it can be inferred that the point lies in the cuda version. I don’t know how to slove this problem and I think it may be a new problem (It’s only been a few days since cuda-12.0 was released.)
Thank you very much!
Sincerely

akohlmey · February 1, 2023, 8:09am

What GPU do you have? What is the output of nvc_get_devices?

The obvious choice would be to downgrade CUDA to a version that works. There really is not much of a reason to upgrade CUDA to a newer version unless there is something specific you need that only the new version offers, but those are rare these days.

Sailang · February 1, 2023, 10:13am

Thank you very much for your reply.
The output of nvc_get_devices is :
Found 1 platform(s).
CUDA Driver Version: 12.0

Device 0: “NVIDIA GeForce GTX 960”
Type of device: GPU
Compute capability: 5.2
Double precision support: Yes
Total amount of global memory: 3.99976 GB
Number of compute units/multiprocessors: 8
Number of cores: 1536
Total amount of constant memory: 65536 bytes
Total amount of local/shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per block: 1024
Maximum group size (# of threads per block) 1024 x 1024 x 64
Maximum item sizes (# threads for each dim) 2147483647 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Clock rate: 1.266 GHz
Run time limit on kernels: Yes
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default
Concurrent kernel execution: Yes
Device has ECC support enabled: No

Actually, there was something wrong with my PC, therefore I just re-compiled everything. And I could only download the latest cuda version which is cuda-12.0 in the NVIDIA CUDA official website. And the error occurred. Thank you for your suggestion.

akohlmey · February 1, 2023, 10:22am

That is not true. CUDA Toolkit 11.7 Downloads | NVIDIA Developer

akohlmey · February 1, 2023, 10:29am

When recompiling, did you adjust the GPU architecture in the makefile before compiling the gpu library? The default is currently set up for a Pascal generation GPU (i.e. sm_60) while you have a Maxwell generation GPU (i.e. sm_52).

You may consider using CMake for building or use Makefile.linux_multi since those are compiling all CUDA kernels for multiple architectures (i.e. create so-called “fat binaries” or the CUDA code).

Sailang · February 1, 2023, 11:21am

Yes, I adjusted. Actually, it was a fresh install.

Sailang · February 1, 2023, 12:46pm

Thank you very much. I think I can solve the problem by recompiling lmp_mpi with cuda-11.7. Thank you very much.

Black-Behemoth · February 1, 2023, 2:53pm

Chiming in. My freshly updated LAMMPS build (LAMMPS 22 Dec 2022) and freshly installed CUDA 12 toolkit result in the very same error
I am quite hesitant to try and downgrade my CUDA due to NVidia going to extreme length to keep one on their most recent version. I have experience when even the most diligent purging of an install still left remnants, leading to a Frankenstein system.

$nvc_get_devices
Found 1 platform(s).
CUDA Driver Version: 12.0

Device 0: “Quadro P5000”
Type of device: GPU
Compute capability: 6.1
Double precision support: Yes
Total amount of global memory: 15.8588 GB
Number of compute units/multiprocessors: 20
Number of cores: 3840
Total amount of constant memory: 65536 bytes
Total amount of local/shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per block: 1024
Maximum group size (# of threads per block) 1024 x 1024 x 64
Maximum item sizes (# threads for each dim) 2147483647 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Clock rate: 1.7335 GHz
Run time limit on kernels: Yes
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default
Concurrent kernel execution: Yes
Device has ECC support enabled: No

akohlmey · February 1, 2023, 3:25pm

Please file a bug-report issue about this on GitHub

You could try compiling for OpenCL instead of CUDA. That may still work. If you are using a recent version of LAMMPS there should be no significant performance difference.

akohlmey · February 1, 2023, 3:42pm

When you install from the “runfile” and only install the CUDA toolkit (not the driver or samples or docs etc), you can install multiple versions at the same time. I use environment modules via “lmod” to load or unload or switch between them. You want the latest driver and kernel module, but using an older toolkit is fine. I have quite a collection (and a bunch of older gcc compilers to match):

$ ll -ld !$/cuda*
ll -ld /usr/local//cuda*
lrwxrwxrwx  1 akohlmey klein   21 Feb  1 10:07 /usr/local//cuda -> /usr/local/cuda-12.0/
drwxr-xr-x 15 root     root  4096 Jan  6  2020 /usr/local//cuda-10.2
drwxr-xr-x 15 akohlmey klein 4096 Mar 21  2022 /usr/local//cuda-11.6
drwxr-xr-x 17 akohlmey klein 4096 Nov  1 12:01 /usr/local//cuda-11.8
drwxr-xr-x 17 akohlmey klein 4096 Feb  1 10:08 /usr/local//cuda-12.0
drwxr-xr-x 15 akohlmey klein 4096 Dec 12  2017 /usr/local//cuda-9.1

I can confirm that version 12.0.1 causes the kernels to fail launching, but version 11.8.0 is compiling the GPU package fine and the executables can run. Same for using OpenCL instead of CUDA.

Black-Behemoth · February 2, 2023, 12:11pm

In case anyone’s curious, here are some benchmarks after recompiling with OpenCL
$ mpirun -np 1 lmp -sf gpu -var x 2 -var y 2 -var z 4 -in in.rhodo.scaled
Total wall time: 0:00:21

$ mpirun -np 2 lmp -sf gpu -var x 2 -var y 2 -var z 4 -in in.rhodo.scaled
Total wall time: 0:00:17

$ mpirun -np 1 lmp -var x 2 -var y 2 -var z 4 -in in.rhodo.scaled
Total wall time: 0:03:25

$ mpirun -np 2 lmp -var x 2 -var y 2 -var z 4 -in in.rhodo.scaled
Total wall time: 0:01:43

$ mpirun -np 4 lmp -var x 2 -var y 2 -var z 4 -in in.rhodo.scaled
Total wall time: 0:00:54

$ mpirun -np 8 lmp -var x 2 -var y 2 -var z 4 -in in.rhodo.scaled
Total wall time: 0:00:28

Nice speedup

Full output for the GPU run in the attached file.
Benchmarks_OpenCLvsMPI (8.3 KB)

pbuscemi · May 5, 2023, 7:56pm

A little belated but perhaps this helps. I encountered a similar error
ISSUE;
Cuda driver error 999 in …line 333… call at file nvd_device.h … Unknown error
while running lmp -sf gpu -in in.rhodo ( or any other suitable bench)

The problem was solved by running sudo apt-get upgrade. Update had no effect

SYSTEM
Linux mint, Lammps 28Mar23 2xRTX3090 wall time 0.00.01
cuda drivers 12.0 , Runtime 11.7 Lammps compiled with:
cmake -C …/cmake/presets/basic.cmake -D PKG_GPU=on -D GPU_API=CUDA cmake
-DBIN2C=/usr/local/cuda-11.7/bin/bin2c is in the PATH - this could not be found in earlier trials