cannot run Lammps on Tesla K20C

Dear Lammps users,

I set a new Tesla K20c card on my workstation. I was compiling LAMMPS for this card with -arch=sm_35, and I successfully obtained the executable file “lmp_openmpi”.

But whenever I tried to launch a simulation on that GPU card, it always reports the error:

ERROR: GPU library not compiled for this accelerator (gpu_extra.h:40)
Cuda driver error 4 in call at file ‘geryon/nvd_device.h’ in line 116.

I did get the libgpu.a at /lib/gpu, and there was no problem on the other GPU cards such as Quadro 2000 and Tesla C2070.

This is the output from deviceQuery :

Detected 2 CUDA Capable device(s)

Device 0: “Tesla K20c”
CUDA Driver Version / Runtime Version 5.0 / 5.0
CUDA Capability Major/Minor version number: 3.5
Total amount of global memory: 4800 MBytes (5032706048 bytes)
(13) Multiprocessors x (192) CUDA Cores/MP: 2496 CUDA Cores
GPU Clock rate: 706 MHz (0.71 GHz)
Memory Clock rate: 2600 Mhz
Memory Bus Width: 320-bit
L2 Cache Size: 1310720 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65536), 3D=(4096,4096,4096)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 2147483647 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 3 / 0
Compute Mode:
< Exclusive Process (many threads in one process is able to use ::cudaSetDevice() with this device) >

Device 1: “Quadro NVS 290”
CUDA Driver Version / Runtime Version 5.0 / 5.0
CUDA Capability Major/Minor version number: 1.1
Total amount of global memory: 255 MBytes (267714560 bytes)
( 2) Multiprocessors x ( 8) CUDA Cores/MP: 16 CUDA Cores
GPU Clock rate: 918 MHz (0.92 GHz)
Memory Clock rate: 400 Mhz
Memory Bus Width: 64-bit
Max Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(8192) x 512, 2D=(8192,8192) x 512
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per multiprocessor: 768
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 2 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Thank you in advance.

Wei

In your package GPU line, are you selecting only GPU 0 (see the doc)? It might be trying to also run on the quadro if you are using multiple mpi ranks. Thanks. - Mike

Thank you for your reply.

Yes, I used the command "package gpu force/neigh 0 0 1".
And I also tried "package gpu force/neigh 1 1 1" and "package gpu force 0 0 1"

It always reports the same error.

Best wishes,

Wei

Hi Wei,

I got lammps working just fine on a K20 system. Maybe for testing try
to remove the other graphic card if that is possible. Or disable it in
BIOS?

The compute mode of the K20 should be "0/default", you now have it set
to 'exclusive process'.

AFAIK If you run 1 cpu thread (serial mode, without mpirun), you can
only use 1 gpu thread. (unless you start the cuda-proxy for a single
thread)

Greetings, Pim

I synced with the current version of LAMMPS and ran the full set of regression tests on a desktop with k20. I also download a clean version from the lammps site and this also worked.

Do you see the problem with only 1 MPI process using the GPU?
If you can't get it to work with a clean version of LAMMPS, please change the variables in your lib/gpu Makefile to read:

CUDR_CPP = mpic++ -g -DUCL_SYNC_DEBUG
CUDR_OPTS =

Clean, remake, relink and then send me the output from the stack trace when the error occurs. Thanks. - Mike

Thank you for your reply.

Yes, I see the problem with only 1 MPI process using the GPU. I installed a clean version with the Makefile as you said, the output is:

LAMMPS (22 Mar 2013)