Some errors while running the sample input: in.rhodo.cuda

Hi,

I encountered some errors while running the sample input script - in.rhodo.cuda - with USER-CUDA package, but I can run the other two sample script (in.eam.cuda, in.lj.cuda) normally.
I just modified the precision to mix precision and the other settings left default.

- Command:
mpirun -np 1 ./cuda_mix -sf cuda -v g 1 -v x 1 -v y 1 -v z 1 -v t 100 < in.rhodo.cuda

- Error message:

Using device 0: Quadro 6000

Cuda error: FixShakeCuda_Shake: Kernel execution failed in file ‘fix_shake_cuda.cu’ in line 156 : unspecified launch failure.

- Output:

LAMMPS (4 Jul 2012)

Using LAMMPS_CUDA

USER-CUDA mode is enabled (lammps.cpp:396)
using 1 OpenMP thread(s) per MPI task

CUDA: Activate GPU

Scanning data file …
4 = max bonds/atom
8 = max angles/atom
18 = max dihedrals/atom
2 = max impropers/atom
Reading data file …
orthogonal box = (-27.5 -38.5 -36.3646) to (27.5 38.5 36.3615)
1 by 1 by 1 MPI processor grid
32000 atoms
32000 velocities
27723 bonds
40467 angles
56829 dihedrals
1034 impropers
Finding 1-2 1-3 1-4 neighbors …
4 = max # of 1-2 neighbors
12 = max # of 1-3 neighbors
24 = max # of 1-4 neighbors
26 = max # of special neighbors
Replicating atoms …
orthogonal box = (-27.5 -38.5 -36.3646) to (27.5 38.5 36.3615)
1 by 1 by 1 MPI processor grid
32000 atoms
27723 bonds
40467 angles
56829 dihedrals
1034 impropers
Finding 1-2 1-3 1-4 neighbors …
4 = max # of 1-2 neighbors
12 = max # of 1-3 neighbors
24 = max # of 1-4 neighbors
26 = max # of special neighbors

Finding SHAKE clusters …
1617 = # of size 2 clusters
3633 = # of size 3 clusters
747 = # of size 4 clusters
4233 = # of frozen angles
PPPMCuda initialization …
G vector = 0.248831
grid = 25 32 32
stencil order = 5
absolute RMS force accuracy = 0.025142
relative force accuracy = 7.57143e-05
brick FFT buffer size/proc = 41070 25600 12321
WARNING: # CUDA: You asked for the usage of Coulomb Tables. This is not supported in CUDA Pair forces. Setting is ignored.
(pair_lj_charmm_coul_long_cuda.cpp:171)

CUDA: VerletCuda::setup: Allocate memory on device for maximum of 32000 atoms…

CUDA: Using precision: Global: 4 X: 8 V: 8 F: 4 PPPM: 4

Setting up run …

CUDA: VerletCuda::setup: Upload data…

Test TpA
Test BpA

CUDA: Timing of parallelisation layout with 10 loops:

CUDA: BpA TpA

16.827072 18.359028

CUDA: Total Device Memory useage post setup: 168.070312 MB

Memory usage per processor = 98.3832 Mbytes
---------------- Step 0 ----- CPU = 0.0000 (sec) ----------------
TotEng = -25356.1745 KinEng = 21444.8303 Temp = 299.0397
PotEng = -46801.0048 E_bond = 2537.9940 E_angle = 10921.3742
E_dihed = 5211.7865 E_impro = 213.5116 E_vdwl = -2307.8633
E_coul = 207021.6923 E_long = -270399.5001 Press = -142.5990
Volume = 307995.0335
========= CUDA-MEMCHECK
========= Invalid global read of size 4
========= at 0x00008a60 in FixShakeCuda_Shake_Kernel
========= by thread (0,0,0) in block (82,0,0)
========= Address 0x0000cb74 is out of bounds

Christian can probably answer this.

Steve