a problem with GPU acceleration

Dear LAMMPS developers,
Recently our GPU server is frequently restarting spontaneously. Here I’d like to share two pictures to show the problem (see attachments). One shows the snapshot of nvidia-smi before the system breaks down when I run one MD simulation of 1.1e5 atoms, 8 GPU cores, and 16 CPU threads; and the other shows the snapshot of a steady run of LBM (lattice Boltzmann methdo) procedures on the same machine (never break down). Actually the GPU utilization of MD procedures is smaller than that of LBM procedures, but the system only breaks down with MD procedures.
Do you have any suggestions on why this happens? Is there some bottlenecks in LAMMPS GPU acceleration that I neglect?
Some other points: when I run the same MD case on 12 CPU threads, 6 GPU cores, the system never restarts. I use gpu package for acceleration. Cutoff is 1.5 nm considering the heterogeneity of the system.
Some hardware info: RAM 64 G, available disk space 4.6 T, Intel Xeon CPU E5-2650 (2.60GHz)32 threads, GPU tesla K804 (8 cores), power 1400 W.
Thank you very much!
Best wishes,
Huanhuan Tian

LBM_no_breakdown.png

MD_before_breakdown.png

Dear LAMMPS developers,
     Recently our GPU server is frequently restarting spontaneously. Here
I'd like to share two pictures to show the problem (see attachments). One
shows the snapshot of nvidia-smi before the system breaks down when I run
one MD simulation of 1.1e5 atoms, 8 GPU cores, and 16 CPU threads; and the
other shows the snapshot of a steady run of LBM (lattice Boltzmann methdo)
procedures on the same machine (never break down). Actually the GPU
utilization of MD procedures is smaller than that of LBM procedures, but
the system only breaks down with MD procedures.
     Do you have any suggestions on why this happens? Is there some
bottlenecks in LAMMPS GPU acceleration that I neglect?

​when a machine reboots spontaneously, then it is in 99.9% a hardware
issue. however, it is difficult to say *which* hardware issue. my
speculation of the suspecting power supply was an educated guess, since
this is often an issue with many GPUs in the same box. another issue could
be some failure/overload of the PCIe bus, that may be triggered by a code
(like the GPU package in LAMMPS), that has to frequently submit kernels and
transfer data. it is impossible to diagnose this from​ remote, and even
with physical access to the machine, it can be difficult to determine the
real cause. computers are very complex hardware and many pieces need to
work well together.

axel.

Dear Axel,
I really appreciate your help. Now we argue that probably the spontaneous restarting is due to overload of PCIe. There are two tests:

  1. 8 cases running simultaneously, with 1 CPU thread, 1 GPU core for each case: system does not break down.
  2. 8 cases running simultaneously, with 2 CPU threads, 1 GPU core for each case: system breaks down (when I add the 8th case).
    According to the manual on page 69: “When using the GPU package with multiple CPUs assigned to one GPU, its performance depends to some extent on high bandwidth between the CPUs and the GPU”; meanwhile our chipset is Intel C602+ICH10R with 8G/s bandwidth, which may not be sufficient for 4 K80s (8 GPU cores totally). Thus overload of PCIe may cause our problem.
    Now my question is, how can we prove this speculation exactly? I haven’t found how to output bandwidth occupation in LAMMPS. Do you have suggestions?
    The attachment is the record of output on screen of three typical tests and my reflections.
    Many thanks and best wishes,
    Huanhuan Tian

the record of output on screen (7.46 KB)

Dear Axel,
     I really appreciate your help. Now we argue that probably the
spontaneous restarting is due to overload of PCIe. There are two tests:
     1. 8 cases running simultaneously, with 1 CPU thread, 1 GPU core for
each case: system does not break down.
     2. 8 cases running simultaneously, with 2 CPU threads, 1 GPU core for
each case: system breaks down (when I add the 8th case).
     According to the manual on page 69: "*When using the GPU package
with multiple CPUs assigned to one GPU, its performance depends to some
extent on high bandwidth between the CPUs and the GPU*"; meanwhile our
chipset is Intel C602+ICH10R with 8G/s bandwidth, which may not be
sufficient for 4 K80s (8 GPU cores totally). Thus overload of PCIe may
cause our problem.
     Now my question is, how can we prove this speculation exactly? I
haven't found how to output bandwidth occupation in LAMMPS. Do you have
suggestions?

​i repeat from my last e-mail:

​it is impossible to diagnose this from​ remote, and even with physical
access to the machine, it can be difficult to determine the real cause.
computers are very complex hardware and many pieces need to work well
together.

axel.

Dear Axel,
Actually I just want to confirm if LAMMPS could output some valuable information to help judge PCIe bandwidth occupation. It seems that the answer is no. Still very appreciate your help!
Best wishes,
Huanhuan Tian