Runtime error using USER-CUDA with pppm solver

Dear all,

  I've been having problems in running simulations with pppm solver with
the user-cuda acceleration package on linux. Running the example
in.phosphate.cuda, I get:

terminate called after throwing an instance of 'cufftResult_t'

  I googled it and found nothing. This kind of error appears in all my
scripts involving the pppm solver (lj/cut runs fine).

  I compiled the cuda library successfully in double and single precisions
and using cufft=1 (whenever I used cufft=0, I wasn't able to compile the
main program). I added just the packages manybody, kspace and user-cuda.
I also used fftw2 and fftw3 for the Fourier transforms. I am using the
current 6-dec version of the lammps code.

  My cuda toolkit is 5.0 and the driver is up-to-date.

  Any thoughts on this issue would be gratefully appreciated.

  Best,
  Luis

Christian will likely have a suggestion.

Steve

I look into it later today.

Christian

Any news on this issue?

Best,
Luis

I tested the Dec6 version and could not see anything going wrong using CUDA5 and 310.19 drivers. Could you send me a number of things:
(i) output of the crashing run
(ii) output of "nvidia-smi -a"
(iii) output of "nvcc --version"

Thanks
Christian

Hi, Christian

(i)

LAMMPS (6 Dec 2012)
# Using LAMMPS_CUDA
USER-CUDA mode is enabled (lammps.cpp:393)
# CUDA: Activate GPU
Reading data file ...
  orthogonal box = (33.0201 33.0201 33.0201) to (86.9799 86.9799 86.9799)
  1 by 1 by 1 MPI processor grid
  10950 atoms
  10950 velocities
Replicating atoms ...
  orthogonal box = (33.0201 33.0201 33.0201) to (194.899 194.899 194.899)
  1 by 1 by 1 MPI processor grid
  295650 atoms
PPPMCuda initialization ...
  G vector = 0.210111
  grid = 108 108 108
  stencil order = 5
  absolute RMS force accuracy = 0.000126177
  relative force accuracy = 8.76251e-06
  brick FFT buffer size/proc = 1520875 1259712 158700
rank 0 in job 113 ipe05_45050 caused collective abort of all ranks
  exit status of rank 0: killed by signal 9

(ii)

==============NVSMI LOG==============

Timestamp : Fri Dec 14 16:09:08 2012
Driver Version : 304.54

Attached GPUs : 1
GPU 0000:02:00.0
    Product Name : GeForce GTX 580
    Display Mode : N/A
    Persistence Mode : Disabled
    Driver Model
        Current : N/A
        Pending : N/A
    Serial Number : N/A
    GPU UUID : GPU-4e627f89-c2bc-ea51-73a4-d94aa65f5af4
    VBIOS Version : 70.10.60.00.82
    Inforom Version
        Image Version : N/A
        OEM Object : N/A
        ECC Object : N/A
        Power Management Object : N/A
    GPU Operation Mode
        Current : N/A
        Pending : N/A
    PCI
        Bus : 0x02
        Device : 0x00
        Domain : 0x0000
        Device Id : 0x108010DE
        Bus Id : 0000:02:00.0
        Sub System Id : 0x15803842
        GPU Link Info
            PCIe Generation
                Max : N/A
                Current : N/A
            Link Width
                Max : N/A
                Current : N/A
    Fan Speed : 40 %
    Performance State : N/A
    Clocks Throttle Reasons : N/A
    Memory Usage
        Total : 1535 MB
        Used : 4 MB
        Free : 1531 MB
    Compute Mode : Default

  (the rest is N/A)

(iii)

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2012 NVIDIA Corporation
Built on Fri_Sep_21_17:28:58_PDT_2012
Cuda compilation tools, release 5.0, V0.2.1221

  On the screen, the message "terminate called after throwing an instance
of 'cufftResult_t'" appears after the identification of the GPU card.

  Thanks!
  Luis

Hm,
if it is not too much hassle could you try updating to the latest drivers (310.19)? I saw at least one bug fixed in it (for another code unrelated to LAMMPS).
If that is not a good idea for you I can dry downgrading again, though funny enough the NVIDIA downloadside actually doesn't list your 304.54 driver. Maybe it was a buggy beta driver and they removed it again?

Christian

Ok, I will ask my admin to update. Sorry, it wasn't clear to me that it
could be a driver issue. I'll let you know if the problem persists, ok?

Thank you!
Luis

Hi Christian,

  I updated the driver to 310.19, recompiled the cuda library and lammps
code and got the same error. Maybe it's important to tell you that I'm
using this set of libraries:

LD_PRELOAD='/usr/local/cuda/lib64/libcufft.so.5.0
/usr/local/cuda/lib64/libcudart.so.5.0 /usr/lib64/libstdc++.so.6'

$ ldd --version
ldd (GNU libc) 2.14.1

  Is there a chance that this problem is due to an incompatibility with
linux libraries (glibc,libstd, etc.) ?

  Best,
  Luis

Hi everyone,

Any news on this matter?

Thanks and happy new year!

Luis

I was still not able to reproduce the error. Now I don't have a GTX580 right now (only a C2075 on the fermi side) so it might be a bug which only occurs on a particular hardware. Would it be possible to get a temporary account on your machine, to do some test directly on your machine?

Thanks
Christian