Dear LAMMPS users and developers,
I ran into a cuda driver error when trying to run lammps on a gpu compute node with four NVIDIA A100. Specifically, the error message is (test with command mpirun -np 32 /home/sijiachen/software/lammps-29Sep2021/build/lmp_beagle3_a100 -sf gpu -pk gpu 1 -in in.chain, in.chain is the one in lammps bench folder):
IPL WARN> IPL_init_numa_nodes: can not define numa node num
LAMMPS (29 Sep 2021 - Update 3)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)
using 1 OpenMP thread(s) per MPI task
Cuda driver error 100 in call at file ‘/home/sijiachen/software/lammps-29Sep2021/lib/gpu/geryon/nvd_device.h’ in line 323.
Abort(-1) on node 0 (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
My lammps version is 29Sep2021, and I built lammps using cmake/3.19, intelmpi/2021.5, gcc/10.2.0 and cuda/11.5, with following command:
cmake -D LAMMPS_MACHINE=beagle3_a100 -D FFT=FFTW3 -D FFT_SINGLE=no -D FFT_PACK=array -D FFTW3_INCLUDE_DIR=/software/fftw3-3.3.9-el8-x86_64/include -D FFTW3_LIBRARY=/software/fftw3-3.3.9-el8-x86_64/lib/libfftw3.so -D LAMMPS_SIZES=smallbig -D LAMMPS_MEMALIGN=64 -D PKG_GPU=yes -D GPU_API=cuda -D GPU_PREC=mixed -D GPU_ARCH=sm_80 -D PKG_OPENMP=yes -D PKG_PLUMED=yes -D PLUMED_MODE=shared -D PKG_DRUDE=yes -D PKG_MOLECULE=yes -D PKG_KSPACE=yes -D PKG_FEP=yes -D PKG_CLASS2=yes -D PKG_RIGID=yes -D PKG_CORESHELL=yes …/cmake
I ran the nvc_get_devices on the compute node and got following results (I am showing only Device 0, but it found all four identical cards).
Found 1 platform(s).
CUDA Driver Version: 11.50
Device 0: “NVIDIA A100-PCIE-40GB”
Type of device: GPU
Compute capability: 8
Double precision support: Yes
Total amount of global memory: 39.5861 GB
Number of compute units/multiprocessors: 108
Number of cores: 20736
Total amount of constant memory: 65536 bytes
Total amount of local/shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per block: 1024
Maximum group size (# of threads per block) 1024 x 1024 x 64
Maximum item sizes (# threads for each dim) 2147483647 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Clock rate: 1.41 GHz
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default
Concurrent kernel execution: Yes
Device has ECC support enabled: Yes
I am not sure if the information is enough. Please let me know if any other information needed to better figure out the problem. Any help would be appreciated!
Thanks,
Sijia