Graphics Card crashes when using GPU package

Sven_Auschra · May 11, 2017, 5:16pm

Dear all,

I am currently using the version stable_31Mar2017 of LAMMPS on a Ubuntu 16.04 LTS linux system.

I run MD simulations of roughly 130000 atoms in a rectangular box of dimensions 200 x 50 x 16, measured in Lennard-Jones units.

The input script as well as the initial configuration file of the system is attached to this email. It is basically supposed to equilibrate the system at a prescribed temperature. Initially, all particles are placed on a fcc lattice. We have 3 types of particles. But, this does not play a role at this point.

To accelerate the simulation I am using the gpu package of LAMMPS doing calculations on a NVIDIA Tesla K20m card.

I start the script via ./lmp_gpu -sf gpu -pk gpu 1 -in in.script

In the beginning, everything works out great and the system evolves without any issues. However, after roughly 40000 steps, the simulation abruptly stops and I get the error message:

Cuda driver error 702 in call at file ‘geryon/nvd_timer.h’ in line 76.
MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD with errorcode -1

After that, I cannot use my Tesla GPU card unless I restart my computer. I cannot even compile a simple CUDA program like vectorAdd, or so.

Do you have any idea what causes this error? Maybe some memory issues?

FYI: I have successfully run simulations of even bigger systems using the same GPUs but on a different computer cluster. The simulation runs perfectly when I do not use the GPU package.

Thank you very much!

Best,

Sven

in.script (1.69 KB)

initJanus.dat.tar.gz (999 KB)

akohlmey · May 11, 2017, 9:38pm

or temperature issue. you could try running the CUDA GPU memtest tool:
https://sourceforge.net/projects/cudagpumemtest/files/

...and perhaps also monitor your GPU while running LAMMPS. your input is
missing the definition for ${rc}, i assumed 2.5 and then run in parallel on
my desktop that has an original GeForce GTX Titan (i.e. very similar to
your Tesla) with: mpirun -np 8 ./lmp_gpu -in in.script -sf gpu
and it can complete your input deck fine.

[...]

49900 -6.2018016 0.8356875 0.57795905
50000 -6.203018 0.8356875 0.57885699
Loop time of 365.264 on 8 procs for 50000 steps with 133710 atoms

Performance: 59135.317 tau/day, 136.887 timesteps/s
94.0% CPU use with 8 MPI tasks x 1 OpenMP threads

Sven_Auschra · May 12, 2017, 1:23pm

    Dear all,

    I am currently using the version /stable_31Mar2017/ of LAMMPS on a
    Ubuntu 16.04 LTS linux system.

    I run MD simulations of roughly 130000 atoms in a rectangular box
    of dimensions 200 x 50 x 16, measured in Lennard-Jones units.

    The input script as well as the initial configuration file of the
    system is attached to this email. It is basically supposed to
    equilibrate the system at a prescribed temperature. Initially, all
    particles are placed on a fcc lattice. We have 3 types of
    particles. But, this does not play a role at this point.

    To accelerate the simulation I am using the gpu package of LAMMPS
    doing calculations on a NVIDIA Tesla K20m card.

    I start the script via ./lmp_gpu -sf gpu -pk gpu 1 -in in.script

    In the beginning, everything works out great and the system
    evolves without any issues. However, after roughly 40000 steps,
    the simulation abruptly stops and I get the error message:

    /Cuda driver error 702 in call at file 'geryon/nvd_timer.h' in
    line 76.//
    //MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD
    with errorcode -1/

    After that, I cannot use my Tesla GPU card unless I restart my
    computer. I cannot even compile a simple CUDA program like
    vectorAdd, or so.

    Do you have any idea what causes this error? Maybe some memory issues?

or temperature issue. you could try running the CUDA GPU memtest
tool: https://sourceforge.net/projects/cudagpumemtest/files/

...and perhaps also monitor your GPU while running LAMMPS. your input
is missing the definition for ${rc}, i assumed 2.5 and then run in
parallel on my desktop that has an original GeForce GTX Titan (i.e.
very similar to your Tesla) with: mpirun -np 8 ./lmp_gpu -in
in.script -sf gpu
and it can complete your input deck fine.

[...]

49900 -6.2018016 0.8356875 0.57795905
   50000 -6.203018 0.8356875 0.57885699
Loop time of 365.264 on 8 procs for 50000 steps with 133710 atoms

Performance: 59135.317 tau/day, 136.887 timesteps/s
94.0% CPU use with 8 MPI tasks x 1 OpenMP threads

MPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %total
---------------------------------------------------------------
Pair | 74.32 | 129.58 | 202.52 | 357.7 | 35.48
Bond | 0.01305 | 0.018413 | 0.024289 | 2.8 | 0.01
Neigh | 1.5542 | 2.6655 | 4.3788 | 50.9 | 0.73
Comm | 48.033 | 69.266 | 87.38 | 160.9 | 18.96
Output | 0.080683 | 0.086414 | 0.087259 | 0.7 | 0.02
Modify | 72.05 | 128.32 | 167.73 | 274.5 | 35.13
Other | | 35.32 | | | 9.67

Nlocal: 16713.8 ave 17130 max 15568 min
Histogram: 1 1 0 0 0 0 0 0 1 5
Nghost: 5933 ave 6056 max 5702 min
Histogram: 1 1 0 0 0 1 0 1 1 3
Neighs: 0 ave 0 max 0 min
Histogram: 8 0 0 0 0 0 0 0 0 0

Total # of neighbors = 0
Ave neighs/atom = 0
Ave special neighs/atom = 0
Neighbor list builds = 6138
Dangerous builds = 0

---------------------------------------------------------------------
      Device Time Info (average):
---------------------------------------------------------------------
Neighbor (CPU): 7.2035 s.
Device Overhead: 28.4566 s.
Average split: 1.0000.
Threads / atom: 4.
Max Mem / Proc: 49.34 MB.
CPU Driver_Time: 27.1397 s.
CPU Idle_Time: 81.7215 s.
---------------------------------------------------------------------

for your reference, here is the output from nvidia-smi during the run:

$ nvidia-smi
Thu May 11 17:32:54 2017
+-----------------------------------------------------------------------------+
> NVIDIA-SMI 375.51 Driver Version: 375.51
       >
>-------------------------------+----------------------+----------------------+
> GPU Name Persistence-M| Bus-Id Disp.A | Volatile
Uncorr. ECC |
> Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util
Compute M. |
>===============================+======================+======================|
> 0 GeForce GTX TITAN Off | 0000:03:00.0 On |
   N/A |
> 52% 78C P0 123W / 250W | 1241MiB / 6081MiB | 62%
Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
> Processes: GPU
Memory |
> GPU PID Type Process name
Usage |
>=============================================================================|
> 0 1940 G /usr/libexec/Xorg
241MiB |
> 0 16515 G ...el-token=9CDE669DD6C764834F0F18AC70B8DA70
233MiB |
> 0 27340 C .../akohlmey/compile/lammps-icms/src/lmp_omp
94MiB |
> 0 27341 C .../akohlmey/compile/lammps-icms/src/lmp_omp
94MiB |
> 0 27342 C .../akohlmey/compile/lammps-icms/src/lmp_omp
93MiB |
> 0 27343 C .../akohlmey/compile/lammps-icms/src/lmp_omp
94MiB |
> 0 27344 C .../akohlmey/compile/lammps-icms/src/lmp_omp
93MiB |
> 0 27345 C .../akohlmey/compile/lammps-icms/src/lmp_omp
98MiB |
> 0 27346 C .../akohlmey/compile/lammps-icms/src/lmp_omp
94MiB |
> 0 27347 C .../akohlmey/compile/lammps-icms/src/lmp_omp
94MiB |
+-----------------------------------------------------------------------------+

so checking whether your GPU is 100% functioning correctly seems like
the right thing to do at this point.

axel.

You were absolutely right – the GPU is almost "burning".

I have monitored the GPU temperature during the gpumemtest as well as
the LAMMPS simulation. Both scripts stop running as soon as the GPU
reaches a critical temperature of 95 (deg. Celsius, I guess), which is
pretty warm!

In contrast to our other K20m cards, this particular one is not
integrated into our cluster's cooling system. We just put it into an
ordinary computer tower. I guess, that causes the temperature issues.

Thanks very much for your help.

Best,

Sven