Problem with GPU library

Dear all.

I keep on fighting with my GTX680 and lammps. I changed my linux distribution to Scientific Linux 6.3 and now I have the two libraries compiled (GPU & USER-CUDA). My first problem concerns to the GPU library. I’m able to run benchmark problems but when I make bigger the simulation box, lammps get stuck after the GPU initialization phase. Here is some info about the context of the problem:

File: lammps/examples/gpu/in.gpu.rhodo

Instance of lammps: mpirun -np 8 /home/ekhi/bin/lmp_openmpi.gea.gpu < in.gpu.rhodo > out.gpu.rhodo.8

I paste the content of out.gpu.rhodo:

LAMMPS (13 Sep 2012)

<small>Scanning data file ...</small>
<small>  4 = max bonds/atom</small>
<small>  18 = max angles/atom</small>
<small>  40 = max dihedrals/atom</small>
<small>  4 = max impropers/atom</small>
<small>Reading data file ...</small>
<small>  orthogonal box = (-27.5 -38.5 -36.2676) to (27.5 38.5 36.2645)</small>
<small>  2 by 2 by 2 MPI processor grid</small>
<small>  32000 atoms</small>
<small>  32000 velocities</small>
<small>  27723 bonds</small>
<small>  40467 angles</small>
<small>  56829 dihedrals</small>
<small>  1034 impropers</small>
<small>Finding 1-2 1-3 1-4 neighbors ...</small>
<small>  4 = max # of 1-2 neighbors</small>
<small>  12 = max # of 1-3 neighbors</small>
<small>  24 = max # of 1-4 neighbors</small>
<small>  26 = max # of special neighbors</small>
<small>Replicating atoms ...</small>
<small>  orthogonal box = (-27.5 -38.5 -36.2676) to (82.5 115.5 108.797)</small>
<small>  2 by 2 by 2 MPI processor grid</small>
<small>  256000 atoms</small>
<small>  221784 bonds</small>
<small>  323736 angles</small>
<small>  454632 dihedrals</small>
<small>  8272 impropers</small>
<small>Finding 1-2 1-3 1-4 neighbors ...</small>
<small>  4 = max # of 1-2 neighbors</small>
<small>  12 = max # of 1-3 neighbors</small>
<small>  24 = max # of 1-4 neighbors</small>
<small>  26 = max # of special neighbors</small>
<small>Finding SHAKE clusters ...</small>
<small>  12936 = # of size 2 clusters</small>
<small>  29064 = # of size 3 clusters</small>
<small>  5976 = # of size 4 clusters</small>
<small>  33864 = # of frozen angles</small>
<small>PPPM initialization ...</small>
<small>  G vector (1/distance)= 0.245959</small>
<small>  grid = 48 64 60</small>
<small>  stencil order = 5</small>
<small>  estimated absolute RMS force accuracy = 0.0410392</small>
<small>  estimated relative force accuracy = 0.000123588</small>
<small>  using double precision FFTs</small>
<small>  brick FFT buffer size/proc = 37555 24576 11655</small>
<small>
--------------------------------------------------------------------------</small>
<small>- Using GPGPU acceleration for pppm:</small>
<small>-  with 8 proc(s) per device.</small>
<small>--------------------------------------------------------------------------</small>
<small>GPU 0: GeForce GTX 680, 1536 cores, 1.6/2 GB, 0.71 GHZ (Double Precision)</small>
<small>--------------------------------------------------------------------------</small>
<small>
Initializing GPU and compiling on process 0...Done.</small>
<small>Initializing GPU 0 on core 0...Done.</small>
<small>Initializing GPU 0 on core 1...Done.</small>
<small>Initializing GPU 0 on core 2...Done.</small>
<small>Initializing GPU 0 on core 3...Done.</small>
<small>Initializing GPU 0 on core 4...Done.</small>
<small>Initializing GPU 0 on core 5...Done.</small>
<small>Initializing GPU 0 on core 6...Done.</small>
<small>Initializing GPU 0 on core 7...Done.</small>
<small>--------------------------------------------------------------------------</small>
<small>- Using GPGPU acceleration for lj/charmm/coul/long:</small>
<small>-  with 8 proc(s) per device.</small>
<small>--------------------------------------------------------------------------</small>
<small>GPU 0: GeForce GTX 680, 1536 cores, 1.4/2 GB, 0.71 GHZ (Double Precision)</small>
<small>--------------------------------------------------------------------------</small>
<small>
Initializing GPU and compiling on process 0...Done.</small>
<small>Initializing GPU 0 on core 0...Done.</small>
<small>Initializing GPU 0 on core 1...Done.</small>
<small>Initializing GPU 0 on core 2...Done.</small>
<small>Initializing GPU 0 on core 3...Done.</small>
<small>Initializing GPU 0 on core 4...Done.</small>
<small>Initializing GPU 0 on core 5...Done.</small>
<small>Initializing GPU 0 on core 6...Done.</small>
<small>Initializing GPU 0 on core 7...Done.</small>
<small>
Setting up run ...</small>

And after that, the systems get stuck and consuming CPU resources.
This error can be replicated with the /bench/GPU/cases, but when i make a simulation box small enough, the case run well…

Any help?

Thanks a lot.

Dear all.

I keep on fighting with my GTX680 and lammps. I changed my linux
distribution to Scientific Linux 6.3 and now I have the two libraries
compiled (GPU & USER-CUDA). My first problem concerns to the GPU library.
I'm able to run benchmark problems but when I make bigger the simulation

you are likely running out of memory on the GPU.
attaching 8 MPI tasks to a single GPU is asking for a lot.
also, in your case it may be more efficient (and less memory
demanding) to use only the pair style on then GPU and
run pppm on the CPU. with the GPU package, this can
run concurrently, which may give you an additional advantage.

if you look through the mailing list archives, you should find
some e-mail from me discussing/testing the various parallelization
and acceleration options and their relative performance with a single
GTX 580 on a 4-way 12-core opteron machine.

axel.

you are likely running out of memory on the GPU.
attaching 8 MPI tasks to a single GPU is asking for a lot.

Actually, the more MPI tasks you assign to a single GPU, the
smaller the problem (per task) that the GPU sees. I've run fine
with the GPU package using 12 CPUs and 1 GPU. Whether
that is optimal is another question, but you shouldn't be running
out of memory on the GPU until you get up to a few 100,000 atoms
(though it is probably smaller for the rhodo case since it has
big neighbor lists). And that is 100K per MPI task I believe.

Steve

you are likely running out of memory on the GPU.
attaching 8 MPI tasks to a single GPU is asking for a lot.

Actually, the more MPI tasks you assign to a single GPU, the
smaller the problem (per task) that the GPU sees. I've run fine
with the GPU package using 12 CPUs and 1 GPU. Whether

your GPU has more memory than the GTX680, has it not?

that is optimal is another question, but you shouldn't be running
out of memory on the GPU until you get up to a few 100,000 atoms
(though it is probably smaller for the rhodo case since it has
big neighbor lists). And that is 100K per MPI task I believe

but there is indeed one other possibility that we've not mentioned.
the memory on the GTX 680 may be less than perfect. i would
suggest to run memory test program on the GPU, e.g.:

http://sourceforge.net/projects/cudagpumemtest/

cheers,
     axel.

Which version of cuda driver are you using (output from nvc_get_devices in lib/gpu)?

Thanks. - Mike

One problem with many MPI ranks per GPU is, that each process on a GPU incurs some constant memory overhead. While I think that that was reduced lately (it tended to be in the range of 100-300MB per process) it can eat up a lot of memory. That said, if you run out of memory with the standard phosphate simulation using USER-CUDA, I'd follow Axels advice and check the memory. There might be something wrong there.

Christian

-------- Original-Nachricht --------

To address these issues for the mailing list archive:

(it tended to be in the range of 100-300MB per process

This depends on the hardware, the ECC configuration, and for nvidia, whether or not proxy/hyper-q is being used. In the screen output the user attached, the free/total memory for the card is shown at p3m initialization: 1.6/2.0GB. The second initialization shows 1.4/2.0GB. This means that the memory allocations for p3m including any overhead for context(s) is less than .2GB.

The maximum memory used per process is output to the screen when using gpu acceleration. Running with a smaller size can be used to get an idea of how much memory is being used if there is a memory allocation error. This can be verified in an alternative way by running nvidia-smi in another window while lammps is running. For replicated rhodo benchmark, 256K, this is about 1.37GB for 4 processes on my hardware and 1.21GB for 1 process.

Axel's suggestion of running p3m concurrently on the CPU is good and will reduce memory usage.

The main issue here, if this is indeed a out of memory error, is why LAMMPS did not exit with an appropriate error.

- Mike

Hi.

Thank you for your answers. They threw some light on my issue.

First of all. My nvc_get_devices says that i'm using the following:

Found 1 platform(s).
Using platform: NVIDIA Corporation NVIDIA CUDA Driver
CUDA Driver Version: 4.20

Device 0: "GeForce GTX 680"
   Type of device: GPU
   Compute capability: 3
   Double precision support: Yes
   Total amount of global memory: 1.99933 GB
  [.............]

I tried to use the cudagpumemtest suggested by axel and I got an error with it. The error says:

>>$ ./cuda_memtest
[09/17/2012 11:37:29][Gea][0]:Running cuda memtest, version 1.2.2
[09/17/2012 11:37:29][Gea][0]:warning: Getting serial number failed
[09/17/2012 11:37:29][Gea][0]:NVRM version: NVIDIA UNIX x86_64 Kernel Module 295.41 Fri Apr 6 23:18:58 PDT 2012
[09/17/2012 11:37:29][Gea][0]:num_gpus=1
[09/17/2012 11:37:29][Gea][0]:Device name=GeForce GTX 680, global memory size=2146762752
[09/17/2012 11:37:29][Gea][0]:major=3, minor=0
[09/17/2012 11:37:29][Gea][0]:Attached to device 0 successfully.
[09/17/2012 11:37:29][Gea][0]:Allocated 1890 MB
[09/17/2012 11:37:29][Gea][0]:Test0 [Walking 1 bit]
[09/17/2012 11:37:30][Gea][0]:ERROR: CUDA error: the launch timed out and was terminated, line 589, file tests.cu
[09/17/2012 11:37:30][Gea][0]:ERROR: CUDA error: the launch timed out and was terminated, line 589, file tests.cu

Therefore, there is something wrong with my card. It is a hardware issue or may be a problem with drivers/toolkit/software?¿?

How can ir reboot the nvidia drivers withour rebooting my entire machine?

Ah, and for our usual work we use EAM potentials so the hint for using one pair_style with GPU and the other one with CPU was great, but useless four our work.

Thank you very much and sorry for the inconvenience