GPU package error when using kspace_style pppm/gpu

Guilherme_da_Silva · April 28, 2014, 8:25pm

Hello all,

I was able to compile LAMMPS with the gpu package and I have already done some simulations without any errors even with pppm/gpu.

Recently, one of my simulations got a lot of errors with cuda drivers, but when I disable just the pppm/gpu (replaces pppm for ewald) it runs just fine. (Noting that I wrote the command suffix gpu in the input file).

The simulation consists of an isolated modified cellulose fragment (an octamer) with 579 atoms. I’m just testing the simulation. More details in attached files.

Together with the LAMMPS Makefiles I used for the software itself and for the gpu library I’m attaching an input and a data file for the simulations.

I usually run the simulations with mpirun -np 4 and specify the package command as “package gpu force 0 3 -1”.

I’m using the 1/02/2014 version of LAMMPS; gpu package compiled with DOUBLE_DOUBLE precision; 4 gpus: TESLAC2050; CUDA version: 5.50; NVIDIA DRIVER version: 310.49.

Here is the output of the error:

Using GPGPU acceleration for lj/charmm/coul/long:
with 1 proc(s) per device.

Makefile.fermi (2.62 KB)

Makefile.linux (1.25 KB)

cellulose.lammps_data (107 KB)

isolada_nve.inp (1.01 KB)

akohlmey · April 29, 2014, 2:14pm

Hello all,

I was able to compile LAMMPS with the gpu package and I have already done
some simulations without any errors even with pppm/gpu.

Recently, one of my simulations got a lot of errors with cuda drivers, but
when I disable just the pppm/gpu (replaces pppm for ewald) it runs just
fine. (Noting that I wrote the command suffix gpu in the input file).

The simulation consists of an isolated modified cellulose fragment (an
octamer) with 579 atoms. I'm just testing the simulation. More details in
attached files.

this is an extremely small system, for which GPU acceleration is meaningless.

Together with the LAMMPS Makefiles I used for the software itself and for
the gpu library I'm attaching an input and a data file for the simulations.

I usually run the simulations with mpirun -np 4 and specify the package
command as "package gpu force 0 3 -1".

if you have more CPU cores, then you should use 2 or 3 CPU cores per
GPU to achieve better GPU utilization. of course, this also only makes
sense for a reasonably large system.

I'm using the 1/02/2014 version of LAMMPS; gpu package compiled with
DOUBLE_DOUBLE precision; 4 gpus: TESLAC2050; CUDA version: 5.50; NVIDIA
DRIVER version: 310.49.

Here is the output of the error:

- Using GPGPU acceleration for lj/charmm/coul/long:
- with 1 proc(s) per device.
--------------------------------------------------------------------------
GPU 0: Tesla C2050, 448 cores, 2.2/2.6 GB, 1.1 GHZ (Double Precision)
GPU 1: Tesla C2050, 448 cores, 2.2/1.1 GHZ (Double Precision)
GPU 2: Tesla C2050, 448 cores, 2.2/1.1 GHZ (Double Precision)
GPU 3: Tesla C2050, 448 cores, 2.2/1.1 GHZ (Double Precision)
--------------------------------------------------------------------------

Initializing GPU and compiling on process 0...Done.
Initializing GPUs 0-3 on core 0...Done.

Setting up run ...
Cuda driver error 1 in call at file 'geryon/nvd_kernel.h' in line 364.

[...]

help / error messages

CUDA error messages are often not very helpful. usually, all you get
is "it worked" or "it didn't work". for a modular GPU interface with a
CUDA/OpenCL abstraction like the GPU package, it is difficult to
provide more specific error location hints without a lot of additional
programming effort. in any case, my guess is that for such a small
system, you may have a problem because there may not be any atoms for
the MPI rank that a GPU is attached to and that can cause all kinds of
problems that are not easily seen, since people rarely run tests with
such "unreasonable" input decks.

I'm wondering if this is an known issue and if there are already solutions
for that.

there are multiple things that you should do:
- run some GPU stress tests (there is a gpumemtest on sourceforge
IIRC, for example) and also check the GPU error status and make sure
that all your GPUs are operating without failure.
- run without GPU acceleration for pppm. specifically for double
precision, it should be faster to run pppm on the CPU concurrently
with the pair style on the GPU rather than one after the other.
- run with a (much) larger problem.

axel.

Nguyen_Dac_Trung · April 29, 2014, 2:31pm

I wanted to add to Axel’s suggestions:

run with 1 MPI task with 1 GPU without dynamic balance (package gpu force 0 0 1, or package gpu force/neigh 0 0 1)

-Trung

akohlmey · April 29, 2014, 4:28pm

I wanted to add to Axel's suggestions:

- run with 1 MPI task with 1 GPU without dynamic balance (package gpu force
0 0 1, or package gpu force/neigh 0 0 1)

good point. this actually makes me wonder, whether there are still
scenarios where distributing work between CPU and GPU this way is
worth doing. with current GPU hardware, you can oversubscribe GPU
quite efficiently and i would expect that this would result in better
overall performance.

or am i missing something?

if not, it might be worth considering to remove that code path and
simplify the code base for easier maintenance.

axel.

Nguyen_Dac_Trung · May 1, 2014, 3:40pm

Although I haven’t used dynamic balance in practice and didn’t see much difference between full split and dynamic split in my LJ benchmark, I think keeping dynamic balance offers users more options to try and we’ll never know beforehand all the possible use cases and the CPU and GPU hardware they have access to. One example I can see dynamic balance might help is where the number of atoms per MPI task is large enough so that the time for host-device data transfers + kernel execution on the GPU for a fraction of local atoms is equal to that for computing the forces for the rest of local atoms on the CPU, and the optimal fraction may vary over the course of simulation.

Anyway, I suppose such a change (removing dynamic balance) to the GPU package in the main repo should be approved/done by its author, Mike Brown. Of course, you can always make that change within your branch.

Best,

-Trung