CUDA driver error 4 at '/lammps-2Aug2023/lib/gpu/geryon/nvd_device.h' in line 429

Mayank7 · January 19, 2024, 3:03pm

Dear Lammps users,

I installed lammps (Aug 2023 version) with using cmake on a local cluster. The installation was successful and ./nvc_get_devices gives:

Found 1 platform(s).
CUDA Driver Version: 12.0

Device 0: “NVIDIA GeForce RTX 4090”
Type of device: GPU
Compute capability: 8.9
Double precision support: Yes
Total amount of global memory: 23.6496 GB
Number of compute units/multiprocessors: 128
Number of cores: 24576
Total amount of constant memory: 65536 bytes
Total amount of local/shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per block: 1024
Maximum group size (# of threads per block) 1024 x 1024 x 64
Maximum item sizes (# threads for each dim) 2147483647 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Clock rate: 2.535 GHz
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Exclusive
Concurrent kernel execution: Yes
Device has ECC support enabled: No

I submitted some jobs in bench folder with the following terminal command:

mpirun --np 1 …/build/lmp -sf gpu -pk gpu 0 -in in.lj

with output:

LAMMPS (2 Aug 2023 - Update 2)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)
using 1 OpenMP thread(s) per MPI task
Lattice spacing in x,y,z = 1.6795962 1.6795962 1.6795962
Created orthogonal box = (0 0 0) to (33.591924 33.591924 33.591924)
1 by 1 by 1 MPI processor grid
Created 32000 atoms
using lattice units in orthogonal box = (0 0 0) to (33.591924 33.591924 33.591924)
create_atoms CPU = 0.002 seconds

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE

Your simulation uses code contributions which should be cited:

* GPU package (short-range, long-range and three-body potentials): doi:10.1016/j.cpc.2010.12.021, doi:10.1016/j.cpc.2011.10.012, doi:10.1016/j.cpc.2013.08.002, doi:10.1016/j.commatsci.2014.10.068, doi:10.1016/j.cpc.2016.10.020, doi:10.3233/APC200086
The log file lists these citations in BibTeX format.

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE

---

* Using acceleration for lj/cut:
* with 1 proc(s) per device.
* with 1 thread(s) per proc.
* Horizontal vector operations: ENABLED
* Shared memory system: No

---

## Device 0: NVIDIA GeForce RTX 4090, 128 CUs, 23/24 GB, 2.5 GHZ (Mixed Precision)

Initializing Device and compiling on process 0…Done.
Initializing Device 0 on core 0…Done.

Generated 0 of 0 mixed pair_coeff terms from geometric mixing rule
Setting up Verlet run …
Unit style : lj
Current step : 0
Time step : 0.005
Per MPI rank memory allocation (min/avg/max) = 9.491 | 9.491 | 9.491 Mbytes
Step Temp E_pair E_mol TotEng Press
0 1.44 -6.7733683 0 -4.6134358 -5.019707
100 0.75745333 -5.7585059 0 -4.6223614 0.20726081
Loop time of 0.0440848 on 1 procs for 100 steps with 32000 atoms

Performance: 979929.146 tau/day, 2268.355 timesteps/s, 72.587 Matom-step/s
100.0% CPU use with 1 MPI tasks x 1 OpenMP threads

## MPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %total

Pair | 0.024244 | 0.024244 | 0.024244 | 0.0 | 54.99
Neigh | 2.6e-07 | 2.6e-07 | 2.6e-07 | 0.0 | 0.00
Comm | 0.0089505 | 0.0089505 | 0.0089505 | 0.0 | 20.30
Output | 9.178e-05 | 9.178e-05 | 9.178e-05 | 0.0 | 0.21
Modify | 0.0053737 | 0.0053737 | 0.0053737 | 0.0 | 12.19
Other | | 0.005424 | | | 12.30

Nlocal: 32000 ave 32000 max 32000 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Nghost: 19657 ave 19657 max 19657 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Neighs: 0 ave 0 max 0 min
Histogram: 1 0 0 0 0 0 0 0 0 0

Total # of neighbors = 0
Ave neighs/atom = 0
Neighbor list builds = 5
Dangerous builds not checked

---


  Device Time Info (average): 


---

## Data Transfer: 0.0133 s.
Neighbor copy: 0.0001 s.
Neighbor build: 0.0005 s.
Force calc: 0.0043 s.
Device Overhead: 0.0031 s.
Average split: 1.0000.
Lanes / atom: 4.
Vector width: 32.
Prefetch mode: None.
Max Mem / Proc: 26.88 MB.
CPU Neighbor: 0.0021 s.
CPU Cast/Pack: 0.0093 s.
CPU Driver_Time: 0.0014 s.
CPU Idle_Time: 0.0122 s.

Total wall time: 0:00:00

But when I submit the job with 2 or more cpus, it gives the error:

mpirun --np 2 …/build/lmp -sf gpu -pk gpu 0 -in in.lj
LAMMPS (2 Aug 2023 - Update 2)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)
using 1 OpenMP thread(s) per MPI task
ERROR: Unable to initialize accelerator for use (src/GPU/gpu_extra.h:65)
Last command: package gpu 0
Cuda driver error 4 in call at file ‘/home/masharma/lammps-2Aug2023/lib/gpu/geryon/nvd_device.h’ in line 429.
Cuda driver error 4 in call at file ‘/home/masharma/lammps-2Aug2023/lib/gpu/geryon/nvd_device.h’ in line 430.
Cuda driver error 4 in call at file ‘/home/masharma/lammps-2Aug2023/lib/gpu/geryon/nvd_device.h’ in line 429.
Cuda driver error 4 in call at file ‘/home/masharma/lammps-2Aug2023/lib/gpu/geryon/nvd_device.h’ in line 430.

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[48720,1],1]
Exit code: 1

the nvcc -version is:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

My aim to use multiple mpi tasks is for kspace calculation for tip4p water model. The log file shows that on 1 MPI with gpu, kspace calculation is taking the most time (since for pppm/tip4p does not have gpu acceleration):

Performance: 29.764 ns/day, 0.806 hours/ns, 172.246 timesteps/s, 1.413 Matom-step/s
100.0% CPU use with 1 MPI tasks x 1 OpenMP threads

MPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %total

Pair | 0.81155 | 0.81155 | 0.81155 | 0.0 | 2.80
Bond | 0.00054313 | 0.00054313 | 0.00054313 | 0.0 | 0.00
Kspace | 25.872 | 25.872 | 25.872 | 0.0 | 89.13
Neigh | 0.0012502 | 0.0012502 | 0.0012502 | 0.0 | 0.00
Comm | 0.33599 | 0.33599 | 0.33599 | 0.0 | 1.16
Output | 0.040797 | 0.040797 | 0.040797 | 0.0 | 0.14
Modify | 1.8704 | 1.8704 | 1.8704 | 0.0 | 6.44
Other | | 0.09564 | | | 0.33

Thanks in advance.

akohlmey · January 19, 2024, 3:15pm

Please use triple backquotes (```) for quoting text mode data.

Please also provide the output of nvidia-smi on your compute node with the GPU.
My suspicion is that you may have a GPU that is in exclusive mode. You should check whether perhaps Multi-Process Service (MPS) is used, which enables exclusive mode. In the former case you either need to reconfigure the GPU (requires root or sudo access) in the latter case you need to recompile LAMMPS with MPS support enabled.

Mayank7 · January 19, 2024, 3:25pm

Dear Axel,

nvidia-smi prints:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:2B:00.0 Off |                  Off |
|  0%   37C    P8    12W / 450W |      1MiB / 24564MiB |      0%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

The MPS support is currently disabled. I remember at some point I did enabled it but was still facing the same error. I may try again. If the former case is true, I may ask the administrator for help.

akohlmey · January 19, 2024, 3:30pm

Here we go. The “E. Process” states that you are using exclusive mode. MPS may actually be even more effective for your purposes in case you want to use more than just a few MPI tasks.

LAMMPS needs to be compiled to be able to use it. The GPU package in LAMMPS does not use the (more common) “runtime” interface but the more low-level “driver” interface to CUDA (because it is so similar to OpenCL so that with some preprocessor magic the same code base can support both CUDA and OpenCL (and also HIP, but that came much later).

Mayank7 · January 19, 2024, 4:04pm

Dear Axel,

I compiled lammps again with the command the commands given below but still ran into the same error.
Perhaps I should try the other way.

cmake -C …/cmake/presets/basic.cmake -D CMAKE_C_COMPILER=gcc-11 -D CMAKE_CXX_COMPILER=g+±11 -D GPU_ARCH=sm_89 -D PKG_GPU=on -D GPU_API=cuda -D PKG_BODY=on -D PKG_EXTRA-FIX=on -D PKG_EXTRA-PAIR=on -D PKG_KSPACE=on -D PKG_MANYBODY=on -D PKG_MOLECULE=on -D PKG_OPENMP=on -D PKG_RIGID=on -D CUDA_MPS_SUPPORT=yes …/cmake/

with the output:

loading initial cache file ../cmake/presets/basic.cmake
-- The CXX compiler identification is GNU 11.3.0
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/g++-11 - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /usr/bin/git (found version "2.39.2") 
-- Running check for auto-generated files from make-based build system
-- Found MPI_CXX: /usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi_cxx.so (found version "3.1") 
-- Found MPI: TRUE (found version "3.1")  
-- Looking for C++ include omp.h
-- Looking for C++ include omp.h - found
-- Found OpenMP_CXX: -fopenmp (found version "4.5") 
-- Found OpenMP: TRUE (found version "4.5") found components: CXX 
-- Found GZIP: /usr/bin/gzip  
-- Found FFMPEG: /usr/bin/ffmpeg  
-- Found PkgConfig: /usr/bin/pkg-config (found version "1.8.1") 
-- Checking for module 'fftw3'
--   Found fftw3, version 3.3.10
-- Found FFTW3: /usr/lib/x86_64-linux-gnu/libfftw3.so  
-- Looking for C++ include cmath
-- Looking for C++ include cmath - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE  
-- Found CUDA: /usr (found version "11.8") 
-- Found OpenMP_CXX: -fopenmp (found version "4.5") 
-- Generating style headers...
-- Generating package headers...
-- Generating lmpinstalledpkgs.h...
-- Could NOT find ClangFormat (missing: ClangFormat_EXECUTABLE) (Required is at least version "8.0")
-- The following tools and libraries have been found and configured:
 * Git
 * MPI
 * FFTW3
 * Threads
 * CUDA
 * OpenMP

-- <<< Build configuration >>>
   LAMMPS Version:   20230802 
   Operating System: Linux Debian 12
   CMake Version:    3.25.1
   Build type:       RelWithDebInfo
   Install path:     /home/masharma/.local
   Generator:        Unix Makefiles using /usr/bin/gmake
-- Enabled packages: BODY;EXTRA-FIX;EXTRA-PAIR;GPU;KSPACE;MANYBODY;MOLECULE;OPENMP;RIGID
-- <<< Compilers and Flags: >>>
-- C++ Compiler:     /usr/bin/g++-11
      Type:          GNU
      Version:       11.3.0
      C++ Flags:     -O2 -g -DNDEBUG
      Defines:       LAMMPS_SMALLBIG;LAMMPS_MEMALIGN=64;LAMMPS_OMP_COMPAT=4;LAMMPS_GZIP;LAMMPS_FFMPEG;FFT_FFTW3;FFT_FFTW_THREADS;LMP_OPENMP;LMP_GPU
-- <<< Linker flags: >>>
-- Executable name:  lmp
-- Static library flags:    
-- <<< MPI flags >>>
-- MPI_defines:      MPICH_SKIP_MPICXX;OMPI_SKIP_MPICXX;_MPICC_H
-- MPI includes:     /usr/lib/x86_64-linux-gnu/openmpi/include;/usr/lib/x86_64-linux-gnu/openmpi/include/openmpi
-- MPI libraries:    /usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi_cxx.so;/usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi.so;
-- <<< GPU package settings >>>
-- GPU API:                  CUDA
-- CUDA Compiler:            /usr/bin/nvcc
-- GPU default architecture: sm_89
-- GPU binning with CUDPP:   OFF
-- CUDA MPS support:         yes
-- GPU precision:            MIXED
-- <<< FFT settings >>>
-- Primary FFT lib:  FFTW3
-- Using double precision FFTs
-- Using threaded FFTs
-- Configuring done
-- Generating done
-- Build files have been written to: /home/masharma/lammps-2Aug2023/build

CUDA driver error 4 at '/lammps-2Aug2023/lib/gpu/geryon/nvd_device.h' in line 429

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

Process name: [[48720,1],1]
Exit code: 1

MPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %total

CUDA driver error 4 at '/lammps-2Aug2023/lib/gpu/geryon/nvd_device.h' in line 429

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

Process name: [[48720,1],1] Exit code: 1

MPI task timing breakdown: Section | min time | avg time | max time |%varavg| %total

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

Process name: [[48720,1],1]
Exit code: 1

MPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %total