Error running LAMMPS with cuda

Dear LAMMPS users and developers,

I ran into a cuda driver error when trying to run lammps on a gpu compute node with four NVIDIA A100. Specifically, the error message is (test with command mpirun -np 32 /home/sijiachen/software/lammps-29Sep2021/build/lmp_beagle3_a100 -sf gpu -pk gpu 1 -in in.chain, in.chain is the one in lammps bench folder):

IPL WARN> IPL_init_numa_nodes: can not define numa node num
LAMMPS (29 Sep 2021 - Update 3)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)
using 1 OpenMP thread(s) per MPI task
Cuda driver error 100 in call at file ‘/home/sijiachen/software/lammps-29Sep2021/lib/gpu/geryon/nvd_device.h’ in line 323.
Abort(-1) on node 0 (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0

My lammps version is 29Sep2021, and I built lammps using cmake/3.19, intelmpi/2021.5, gcc/10.2.0 and cuda/11.5, with following command:

cmake -D LAMMPS_MACHINE=beagle3_a100 -D FFT=FFTW3 -D FFT_SINGLE=no -D FFT_PACK=array -D FFTW3_INCLUDE_DIR=/software/fftw3-3.3.9-el8-x86_64/include -D FFTW3_LIBRARY=/software/fftw3-3.3.9-el8-x86_64/lib/libfftw3.so -D LAMMPS_SIZES=smallbig -D LAMMPS_MEMALIGN=64 -D PKG_GPU=yes -D GPU_API=cuda -D GPU_PREC=mixed -D GPU_ARCH=sm_80 -D PKG_OPENMP=yes -D PKG_PLUMED=yes -D PLUMED_MODE=shared -D PKG_DRUDE=yes -D PKG_MOLECULE=yes -D PKG_KSPACE=yes -D PKG_FEP=yes -D PKG_CLASS2=yes -D PKG_RIGID=yes -D PKG_CORESHELL=yes …/cmake

I ran the nvc_get_devices on the compute node and got following results (I am showing only Device 0, but it found all four identical cards).

Found 1 platform(s).
CUDA Driver Version: 11.50

Device 0: “NVIDIA A100-PCIE-40GB”
Type of device: GPU
Compute capability: 8
Double precision support: Yes
Total amount of global memory: 39.5861 GB
Number of compute units/multiprocessors: 108
Number of cores: 20736
Total amount of constant memory: 65536 bytes
Total amount of local/shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per block: 1024
Maximum group size (# of threads per block) 1024 x 1024 x 64
Maximum item sizes (# threads for each dim) 2147483647 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Clock rate: 1.41 GHz
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default
Concurrent kernel execution: Yes
Device has ECC support enabled: Yes

I am not sure if the information is enough. Please let me know if any other information needed to better figure out the problem. Any help would be appreciated!

Thanks,
Sijia

Please try running with just one MPI rank instead of 32.

Thank you for your prompt reply!

When I set mpirun -n 1, it just gave the same error msg:

IPL WARN> IPL_init_numa_nodes: can not define numa node num
LAMMPS (29 Sep 2021 - Update 3)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)
using 1 OpenMP thread(s) per MPI task
Cuda driver error 100 in call at file ‘/home/sijiachen/software/lammps-29Sep2021/lib/gpu/geryon/nvd_device.h’ in line 323.
Abort(-1) on node 0 (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0

I have since checked the specific code triggering the error and it seems that LAMMPS fails to initialize the GPU device(s) at the very beginning of using them. The error code 100 means that there is no device. So this may be due to some permission mismatch or some other local setup issue. If you are running on a machine with a batch system, you may not have requested GPU access.

Thank you so much! This information is really helpful. I will communicate with our cluster manager to see if there is any wrong with the permission or local setup.
Hope you have a nice week.