Errors while compiling GPU package at Nvidia A40

realac · September 30, 2021, 4:08am

There is no error reported while building, but when I run it, the code will fail
The error “Cuda driver error 100 in call at file ‘/data/sourcecode/lammps/lammps-stable_29Sep2021/lib/gpu/geryon/nvd_device.h’ in line 323” was shown.
I tried the latest version and the previous one, the code always failed at " CU_SAFE_CALL_NS(cuInit(0))"

The compiling environment: CentOS 7, CUDA_11.1.0_455.23.05, devtoolset-9, and intel-2018

cmake3 -C …/cmake/presets/basic.cmake -DCMAKE_INSTALL_PREFIX=/data/apps/lammps -D PKG_GPU=on -D GPU_API=cuda -D GPU_ARCH=sm_80/86 -DBUILD_MPI=yes -DBUILD_OMP=yes …/cmake

The problem should be related to A40 GPU card. My previous T40 GPU card is ok for the compiling.

akohlmey · September 30, 2021, 4:13am

Please provide the output of nvc_get_devices. And the output of nvidia-smi.
What happens, if you do not set GPU_ARCH?
Have you tried compiling for -D GPU_API=opencl? Do you get the same kind of error?

realac · September 30, 2021, 4:53am

Yes, I set GPU_ARCH, I tried sm_75, sm_80 and sm_86 for A40, but all failed.
I have not tried -D GPU_API=opencl. I haven’t installed OPENCL library. CUDA’s performance should be better for a Nvidia GPU card,

nvidia-smi

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

nvc_get_devices

Found 1 platform(s).
CUDA Driver Version: 11.20
Device 0: “A40”
Type of device: GPU
Compute capability: 8.6
Double precision support: Yes
Total amount of global memory: 44.5645 GB
Number of compute units/multiprocessors: 84
Number of cores: 16128
Total amount of constant memory: 65536 bytes
Total amount of local/shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per block: 1024
Maximum group size (# of threads per block) 1024 x 1024 x 64
Maximum item sizes (# threads for each dim) 2147483647 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Clock rate: 1.74 GHz
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default
Concurrent kernel execution: Yes
Device has ECC support enabled: Yes

akohlmey · September 30, 2021, 4:59am

LAMMPS has its own OpenCL loader, so not additional software is needed. The CUDA driver (not toolkit) comes with the required OpenCL runtime and ICD configuration file.

With the updates to the GPU package included in the 29Sep2021 stable release, the performance of OpenCL should be much better than in previous versions of the GPU package and in my tests it was comparable with the CUDA version. Besides, a CUDA version that crashes would not be faster than an OpenCL version that doesn’t.

akohlmey · September 30, 2021, 5:01am

I forgot to ask: which input are you running? does this crash happen with any input?
Also those in the bench or examples folders of LAMMPS?

realac · September 30, 2021, 5:08am

I just tried -D GPU_API=opencl. The work still failed.
I
n the output:
LAMMPS (29 Sep 2021)
using 1 OpenMP thread(s) per MPI task
ERROR: Invalid OpenCL platform ID. (src/GPU/gpu_extra.h:77)
Last command: package gpu 0

It is a just test job:

3d Lennard-Jones melt

variable x index 1
variable y index 1
variable z index 1

variable xx equal 120*$x
variable yy equal 120*$y
variable zz equal 120*$z

units lj
atom_style atomic

lattice fcc 0.8442
region box block 0 {xx} 0 {yy} 0 ${zz}
create_box 1 box
create_atoms 1 box
mass 1 1.0

velocity all create 1.44 87287 loop geom

pair_style lj/cut 2.5
pair_coeff 1 1 1.0 1.0 2.5

neighbor 0.3 bin
neigh_modify delay 0 every 20 check no

fix 1 all nve

dump 1 all custom 100 lj.dump id type x y z vx vy vz

thermo 1000
run 10000

akohlmey · September 30, 2021, 7:56am

This looks very unusual. Almost like you cannot properly access the GPU for computing.

Have you been able to run any other GPU accelerated software?

You may also want to try out the KOKKOS package in LAMMPS which has a completely different code path than the GPU package.

realac · September 30, 2021, 2:35pm

Still failed using KOKKOS package

The ERROR mesage:
terminate called after throwing an instance of ‘std::runtime_error’
what(): cudaGetDeviceCount(&m_cudaDevCount) error( cudaErrorNoDevice): no CUDA-capable device is detected /data/sourcecode/lammps/lammps-20Sep2021/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:224
Traceback functionality not available

terminate called after throwing an instance of ‘std::runtime_error’
what(): cudaGetDeviceCount(&m_cudaDevCount) error( cudaErrorNoDevice): no CUDA-capable device is detected /data/sourcecode/lammps/lammps-20Sep2021/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:224
Traceback functionality not available

It seems the Nvidia A40 GPU card cannot be recognized by LAMMPS.

I also tried VASP of gpu version and one code of our group based on pytorch. It is fine to run these codes.

akohlmey · September 30, 2021, 2:53pm

LAMMPS is completely agnostic to the details of how to access the hardware.
This is all delegated to the CUDA toolkit and the corresponding driver. The fact that it always fails when opening the device is a strong hint in that direction. This suggests that there is something inconsistent with your machine setup or that you are not using a CUDA toolkit version compatible with your specific GPU.

To follow this up in a more consistent way, I suggest you provide suitable summary of this discussion (please also include the output of nvcc --version and gcc --version) and then submit this as a “Bug report” issue at Issues · lammps/lammps · GitHub so we can involve people that are maintaining the relevant code and also experts from Nvidia that know LAMMPS.

realac · September 30, 2021, 3:02pm

Many thanks. I will report it.