cuda driver error

Dear lammps gpu users/developers,

I ran into a cuda driver run time error when trying to run lammps on a cluster with four Tesla S2050 hanging on a CPU node. Specifically, the error msg is:
Cuda driver error 101 in call at file 'geryon/nvd_device.h' in line 266.

Looking at nvd_device.h, it occurs in a method that is setting the cuda device to the specified device number. In the fix gpu I am currently just asking it to run on device 0, but changing the device has no effect.

I ran the nvc_get_devices on the node and the specifications match those that I built the gpu lib and compiled lmp_glory with (I am showing only Device 0, but it found all four identical cards).
Found 1 platform(s).
Using platform: NVIDIA Corporation NVIDIA CUDA
CUDA Driver Version: 3.20
CUDA Runtime Version: 3.20

Device 0: "Tesla S2050"
   Type of device: GPU
   Compute capability: 2
   Double precision support: Yes
   Total amount of global memory: 2.99969 GB
   Number of compute units/multiprocessors: 14
   Number of cores: 448
   Total amount of constant memory: 65536 bytes
   Total amount of local/shared memory per block: 49152 bytes
   Total number of registers available per block: 32768
   Warp size: 32
   Maximum number of threads per block: 1024
   Maximum group size (# of threads per block) 1024 x 1024 x 64
   Maximum item sizes (# threads for each dim) 65535 x 65535 x 1
   Maximum memory pitch: 2147483647 bytes
   Texture alignment: 512 bytes
   Clock rate: 1.147 GHz
   Concurrent copy and execution: Yes
   Run time limit on kernels: No
   Integrated: No
   Support host page-locked memory mapping: Yes
   Compute mode: Exclusive
   Concurrent kernel execution: Yes
   Device has ECC support enabled: No

Any help would be appreciated.

Thanks,
Kevin

The error is a invalid device error. Not sure why you should get this. Are
any non-cuda capable gpus on the nodes?

Another user found a bug in the device selection in the GPU library that
has not yet been patched - always starts at device 0.

Change

  int my_gpu=node_rank/_procs_per_gpu;

in lib/gpu/pair_gpu_device.cpp to

  int my_gpu=node_rank/_procs_per_gpu+first_gpu;

And see if selecting another device works...

- Mike

Thanks for the help. That change to my_gpu worked. I also had no problem selecting any of the other devices on the node.

The error is a invalid device error. Not sure why you should get this. Are
any non-cuda capable gpus on the nodes?

Fyi, only cuda capable cards are on the nodes I selected.

Another user found a bug in the device selection in the GPU library that
has not yet been patched - always starts at device 0.

Change

int my_gpu=node_rank/_procs_per_gpu;

in lib/gpu/pair_gpu_device.cpp to

int my_gpu=node_rank/_procs_per_gpu+first_gpu;

And see if selecting another device works…

  • Mike

Best,
Kevin

Hi Mike,

The error is a invalid device error. Not sure why you should get this. Are
any non-cuda capable gpus on the nodes?

Another user found a bug in the device selection in the GPU library that
has not yet been patched - always starts at device 0.

I think I may have the same bug as above b/c after testing some more I realized that some runs find the device and some give the same cuda driver error that I mentioned below. I pretty confident that if the device number in my fix gpu is currently busy with another job, it doesn't try and find the open device on the node. Was that the bug you mentioned above? Since we have four gpu hanging on a node, I have ~1/4 chance that the run will find the open card.

Change

int my_gpu=node_rank/_procs_per_gpu;

in lib/gpu/pair_gpu_device.cpp to

int my_gpu=node_rank/_procs_per_gpu+first_gpu;

And see if selecting another device works...

- Mike

Thanks,
Kevin

Hi Kevin,

The bug I was referring to is fixed by the lines in the code below. LAMMPS will not try to find a GPU not in use. You will have to use an nvidia tool for this and then set the GPU accordingly.

- Mike