unspecified launch failure in USER-CUDA

christian,

as of recently, i have to use the code in the USER-CUDA
package with enforcing the thread per atom strategy or
else i'm getting the dreaded "unspecified launch failure".
as a workaround, i am using this:
package cuda gpu/node 4 override/bpa 0

i am positing this to the lammps mailing list, since
i wonder if i am the only one seeing this, or have
others chime in and let us know whether or not
this works and under which compilation flags and
with which hardware they run. it is not that big of
an issue for me right now, since for my current
application, the GPU package is actually
performing a bit faster with 2x oversubscribing the
GPUs, but it is annoying and should be tracked
down and fixed.

this is with CUDA 4.1 and libcuda.a compiled
for mixed precision and happens on different
machines and with both consumer and tesla
generation 2.0 hardware.

let me know, if you need anything else.

cheers,
    axel.

Hi Axel

can you send me your input script. I had one other incident reported, and have seen that myself as well - but cant reproduce it with my current configuration. If you got a reproduction case I'd be happy to get that.

Best regards
Christian

-------- Original-Nachricht --------

i have two, albeit very similar ones.

i don't want to post them to the list, since the
smaller one is over 20 MB compressed.
you can either grab them from the NCSA forge
machine or i dump them on our local webserver.

let me know which way of transport your prefer.

thanks,
     axel.

btw: here is the output:

[[email protected]... test] mpirun \-np 6 \-mca btl sm,self \~akohlmey/compile/lammps\-icms/src/lmp\_forge\-cuda \-log none \-in in\.two\_vesicle\-gpu > & run\.out \[akohlmey@\.\.\.3255\.\.\. test\] cat run.out
LAMMPS (17 Feb 2012-ICMS)
# Using LAMMPS_CUDA
USER-CUDA mode is enabled (lammps.cpp:396)
# CUDA: Activate GPU
# Using device 1: Tesla M2070
# Using device 1: Tesla M2070
# Using device 0: Tesla M2070
Scanning data file ...
  3 = max bonds/atom
  6 = max angles/atom
Reading data file ...
  orthogonal box = (-121.648 -121.648 -243.297) to (121.648 121.648 243.297)
# Using device 0: Tesla M2070
# Using device 1: Tesla M2070
# Using device 0: Tesla M2070
  1 by 2 by 3 MPI processor grid
  325044 atoms
  325044 velocities
  43056 bonds
  43056 angles
Finding 1-2 1-3 1-4 neighbors ...
  3 = max # of 1-2 neighbors
  3 = max # of 1-3 neighbors
  6 = max # of 1-4 neighbors
  8 = max # of special neighbors
Finding 1-2 1-3 1-4 neighbors ...
  3 = max # of 1-2 neighbors
  3 = max # of 1-3 neighbors
  6 = max # of special neighbors
PPPM initialization ...
  G vector (1/distance)= 0.0660257
  grid = 12 12 24
  stencil order = 3
  RMS precision = 5.6801e-06
  using double precision FFTs
  brick FFT buffer size/proc = 1485 576 990
# CUDA: VerletCuda::setup: Allocate memory on device for maximum of
59591 atoms...
# CUDA: Using precision: Global: 4 X: 8 V: 8 F: 4 PPPM: 4
Setting up run ...
# CUDA: VerletCuda::setup: Upload data...
Test TpA
Test BpA
Cuda error: Cuda_Pair: before updateNmax failed in file 'cuda_pair.cu'
in line 876 : unspecified launch failure.
Cuda error: Cuda_Pair: before updateNmax failed in file 'cuda_pair.cu'
in line 876 : unspecified launch failure.
Cuda error: Cuda_Pair: before updateNmax failed in file 'cuda_pair.cu'
in line 876 : unspecified launch failure.