Cuda driver error 1 , Cuda driver error 4

hong_bingbing · March 13, 2012, 9:46pm

Dear all,

I met with this problem when I tried to run a system of nanoparticles grafted with polymers using gpu package. It went through well until it arrived at PPPM and began to initialize GPU. It then suddenly stopped. There are no errors showing up in log file, just stop (see following). But in the .o#### file lots of “Cuda driver errors” printed (also see following) . Is it related to memory problems? If running on CPU one processor, it just 42 MB.

log or screen output

540 atoms in group nano
2400 atoms in group cathead
17280 atoms in group chain
60 rigid bodies with 540 atoms
480 rigid bodies with 2400 atoms
PPPM initialization …
G vector (1/distance)= 0.268778
grid = 36 36 36
stencil order = 5
estimated absolute RMS force accuracy = 0.0179153
estimated relative force accuracy = 5.39513e-05
using double precision FFTs
brick FFT buffer size/proc = 68921 46656 15129

akohlmey · March 13, 2012, 9:55pm

Dear all,

I met with this problem when I tried to run a system of nanoparticles
grafted with polymers using gpu package. It went through well until it
arrived at PPPM and began to initialize GPU. It then suddenly stopped. There
are no errors showing up in log file, just stop (see following). But in the
.o#### file lots of "Cuda driver errors" printed (also see following) . Is
it related to memory problems? If running on CPU one processor, it just 42
MB.

impossible to say. please provide a complete input deck
so that we can re-run your calculation and track down
any issues independently.

also let us know which version of the code you are using
and provide the output of nvc_get_devices.

axel.

_Christian_Muller · March 13, 2012, 9:59pm

Hi

just one comment: for GPU related problems please allways provide the cudatoolkit version number and the gpu-driver version.

The first one you get for example from "nvcc --version" the second one with "nvidia-smi -a | grep Driver". While it might not be the case here sometimes problems are caused by old/buggy drivers. At least I had these type of issues several times already.

Also since this looks like a bug somewhere (usually LAMMPS should abort with some LAMMPS related error message if there is something wrong with the input you provide) you should post the whole input files (script+data file if needed) to rerun you simulation. Otherwise its almost impossible to find out whats going wrong.

Cheers
Christian

-------- Original-Nachricht --------

hong_bingbing · March 14, 2012, 3:28pm

Hi, Axel and Christian,

Attached please find all input files. Lammps code I’m using is the latest version, 5Mar12. Cudatoolkit version is “release 3.2, V0.2.1221”. GPU-driver version is 260.19.21. The output of nvc_get_devices are shown below:

Found 1 platform(s).
Using platform: NVIDIA Corporation NVIDIA CUDA Driver
CUDA Driver Version: 3.20

Device 0: “GeForce GTX 480”
Type of device: GPU
Compute capability: 2
Double precision support: Yes
Total amount of global memory: 1.49957 GB
Number of compute units/multiprocessors: 15
Number of cores: 480
Total amount of constant memory: 65536 bytes
Total amount of local/shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum group size (# of threads per block) 1024 x 1024 x 64
Maximum item sizes (# threads for each dim) 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Clock rate: 1.401 GHz
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Exclusive
Concurrent kernel execution: Yes
Device has ECC support enabled: No

Device 1: “GeForce GTX 480”
Type of device: GPU
Compute capability: 2
Double precision support: Yes
Total amount of global memory: 1.49969 GB
Number of compute units/multiprocessors: 15
Number of cores: 480
Total amount of constant memory: 65536 bytes
Total amount of local/shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum group size (# of threads per block) 1024 x 1024 x 64
Maximum item sizes (# threads for each dim) 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Clock rate: 1.401 GHz
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Exclusive
Concurrent kernel execution: Yes
Device has ECC support enabled: No

Thanks in advance
Bingbing

Archive.zip (615 KB)

_Christian_Muller · March 14, 2012, 4:00pm

Dear Bingbing

You are using a very old driver, which even was known at the time it was recent to be buggy (at least I got in all kinds of trouble with the 260.xx releases). Also your cuda toolkit version is rather old. I am not sure that Mike is still testing with that to ensure everything works.

So my recommendation is to first update the CUDA toolkit to CUDA 4.1 and also update the driver to 290.xx (whatever is the default driver from NVIDIAs homepage right now). You need to ask your admin to do that if you are not one yourself.

But chances are good that it really is a bug in the Driver or something, and that things work with a more recent cuda software stack.

Regards
Christian

-------- Original-Nachricht --------

hong_bingbing · March 14, 2012, 6:59pm

Thanks, Christian.

We tried an early version of lammps code (29Jan12). No cuda driver errors anymore. It can run smoothly. Any updates after 29Jan12 require higher version of cuda and drivers? or any bugs in the updates?

Best
Bingbing

_Brown_W_Michael · March 15, 2012, 2:09am

There was no intentional change that would limit backwards compatibility. In lib/gpu/Makefile.whatever_you_use, can you please add

-DUCL_SYNC_DEBUG

to the CUDR_OPTS line, clean, remake, and relink. Then can you run and send me the error output?

This will force all of the calls to the GPU driver to block until completion so that we can get the actual line number for the error. Thanks.

- Mike

hong_bingbing · March 16, 2012, 4:32pm

Hi, Mike,

I recompiled the 5Mar12 version following your instruction. There seems to be no differences in error output before and after adding -DUCL_SYNC_DEBUG. See the following error messages from .oxxxxxx file.

Cuda driver error 1 in call at file ‘geryon/nvd_memory.h’ in line 466.
[andros-10:05109] MPI_ABORT invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode -1
Cuda driver error 4 in call at file ‘geryon/nvd_device.h’ in line 116.
[andros-10:05109] MPI_ABORT invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode -1
Cuda driver error 4 in call at file ‘geryon/nvd_timer.h’ in line 98.
[andros-10:05109] MPI_ABORT invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode -1
Cuda driver error 4 in file ‘geryon/nvd_timer.h’ in line 98.
[andros-10:05109] MPI_ABORT invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode -1
Cuda driver error 4 in call at file ‘geryon/nvd_timer.h’ in line 99.
[andros-10:05109] MPI_ABORT invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode -1
Cuda driver error 4 in file ‘geryon/nvd_timer.h’ in line 99.
[andros-10:05109] MPI_ABORT invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode -1
Cuda driver error 4 in call at file ‘geryon/nvd_timer.h’ in line 98.
[andros-10:05109] MPI_ABORT invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode -1
Cuda driver error 4 in file ‘geryon/nvd_timer.h’ in line 98.
[andros-10:05109] MPI_ABORT invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode -1
Cuda driver error 4 in call at file ‘geryon/nvd_timer.h’ in line 99.
[andros-10:05109] MPI_ABORT invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode -1
Cuda driver error 4 in file ‘geryon/nvd_timer.h’ in line 99.
[andros-10:05109] MPI_ABORT invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode -1
Cuda driver error 4 in call at file ‘geryon/nvd_timer.h’ in line 98.
[andros-10:05109] MPI_ABORT invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode -1
Cuda driver error 4 in file ‘geryon/nvd_timer.h’ in line 98.
[andros-10:05109] MPI_ABORT invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode -1
Cuda driver error 4 in call at file ‘geryon/nvd_timer.h’ in line 99.
[andros-10:05109] MPI_ABORT invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode -1
Cuda driver error 4 in file ‘geryon/nvd_timer.h’ in line 99.
[andros-10:05109] MPI_ABORT invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode -1
Cuda driver error 4 in call at file ‘geryon/nvd_timer.h’ in line 98.
[andros-10:05109] MPI_ABORT invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode -1
Cuda driver error 4 in file ‘geryon/nvd_timer.h’ in line 98.
[andros-10:05109] MPI_ABORT invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode -1
Cuda driver error 4 in call at file ‘geryon/nvd_timer.h’ in line 99.
[andros-10:05109] MPI_ABORT invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode -1
Cuda driver error 4 in file ‘geryon/nvd_timer.h’ in line 99.
[andros-10:05109] MPI_ABORT invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode -1
Cuda driver error 4 in call at file ‘geryon/nvd_timer.h’ in line 98.
[andros-10:05109] MPI_ABORT invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode -1
Cuda driver error 4 in file ‘geryon/nvd_timer.h’ in line 98.
[andros-10:05109] MPI_ABORT invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode -1
Cuda driver error 4 in call at file ‘geryon/nvd_timer.h’ in line 99.
[andros-10:05109] MPI_ABORT invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode -1
Cuda driver error 4 in file ‘geryon/nvd_timer.h’ in line 99.
[andros-10:05109] MPI_ABORT invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode -1
Cuda driver error 4 in call at file ‘geryon/nvd_timer.h’ in line 98.
[andros-10:05109] MPI_ABORT invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode -1
Cuda driver error 4 in file ‘geryon/nvd_timer.h’ in line 98.
[andros-10:05109] MPI_ABORT invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode -1
Cuda driver error 4 in call at file ‘geryon/nvd_timer.h’ in line 99.
[andros-10:05109] MPI_ABORT invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode -1
Cuda driver error 4 in file ‘geryon/nvd_timer.h’ in line 99.
[andros-10:05109] MPI_ABORT invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode -1

Nguyen_Dac_Trung · March 16, 2012, 10:29pm

Hi Hong,

I tried the input script and data file you attached the other day and noticed that you are using package gpu force/neigh for pair_style hybrid colloid and lj/cut/coul/long/gpu with pppm/gpu. In this case, the GPU package will throw an error “Cannot use pair hybrid with GPU neighbor builds”. Did you see that error message from the output?

I switched to “package gpu force” and the simulation seems to run fine. Can you try that to see if the errors persist?

Cheers,
-Trung

hong_bingbing · March 18, 2012, 4:18pm

Thanks Trung. But I also found this mistake soon after I sent the input files. The correction of package command can not solve the “cuda driver errors”. With “package gpu force”, I can run it smoothly using 29Jan12 version lammps, but still have the same cuda driver error 1 and cuda driver error 4 using 5Mar12 version.

akohlmey · March 18, 2012, 4:32pm

Thanks Trung. But I also found this mistake soon after I sent the input
files. The correction of package command can not solve the "cuda driver
errors". With "package gpu force", I can run it smoothly using 29Jan12
version lammps, but still have the same cuda driver error 1 and cuda driver
error 4 using 5Mar12 version.

i am not saying that there is not a problem
with running the current GPU code and old
driver/toolkit combo, but there are a number
of things that i would like to remark.

why don't you update the nvidia driver?
the one you have is known to be problematic.
you can always use an older toolkit with a
newer driver.

you could install the newer 4.1 toolkit at the
same time and get additional speed as a bonus.
it doesn't make much sense to me that you insist
on having to use the very latest lammps software,
but use an old toolkit/driver.

it also looks as if your system is rather small and
you seem to be heavily oversubscribing the GPUs.
do you actually get a speed benefit and particularly,
do you get a speed benefit from running pppm on
the GPU. for such a small system, it is possible
that running pppm on the CPU might be even faster.

of course you can always stick with the older
version of LAMMPS until somebody has identified
the problem (which may well be the driver), unless
you can make a compelling argument that your
problem needs to be solved right now.

thanks,
axel.

Nguyen_Dac_Trung · March 18, 2012, 9:34pm

Hi Hong,

I tried lammps-5Mar12 with "package gpu force", and the simulation
runs fine on both Mac 10.6.8 and Linux 64-bit.

I agree with Axel's and Christian's suggestions on updating CUDA
toolkit/driver on your machine.

Cheers,
-Trung

_Brown_W_Michael · March 19, 2012, 12:50am

I think that I know the cause of the problems with cuda 3.2. I would like to have this fixed, but it might not be until next week. I will send you something to test directly.

Thanks for working with us to correct this. – Mike

akohlmey · March 19, 2012, 1:18am

just FYI,

i compiled the current LAMMPS svn/git code using
the cuda 3.2 toolkit module on our cluster and can
run, e.g. the in.gpu.rhodo, input without a problem.

mind you, this is using the newer driver.

cheers,
axel.