ELASTIC_T problem (maybe) on GPU

Domenico_Marson · March 24, 2020, 10:39am

Dear all,

I was trying to adapt and apply the LAMMPS example found in the ELASTIC_T directory to a different system (LJ potential, pppm electrostatic - one graphene sheet and nylon chains).

Maybe I encountered a bug in the GPU package (or maybe is an effect of the lower precision of the GPU computations, I don’t know), but with the GPU package turned on all (except one) simulations crashed, every time at a different step of the simulation.

I tried the ELASTIC_T procedure in

adiabatic conditions
fix NVE+Langevin
fix NVT

The three systems were each run on

plain CPU (MPI + pkg OMP with 1 thread),
MPI + pkg GPU on a P100
MPI + pkg GPU on a RTX-2080 super

I’m attaching an archive with the results obtained of every simulation and the data/input files needed to reproduce the issue.
Also, the file can be downloaded here ( https://www.dropbox.com/s/9y075qlhecoxv2z/not_working_gpu_ELASTIC_T.tar.xz?dl=0 ).
All simulations in the directory were done with openMPI 4.0.3, Cuda 10.2.89, LAMMPS 18 Feb 2020. The OS was Centos 7 for the CPU / P100 runs and Ubuntu 18.0.4 for the RTX2080S runs.

I performed many more runs, with different versions of openMPI or mpich, but the results are the same.

All runs on the CPU went fine and the results are reported (the runs are too short to be scientific reasonable, although the results are somewhat reproducible, I was still in the getting-familiar phase). The runs on GPU either failed without a clear error reporting or with a generic Segmentation Fault or Address not mapped (1). Just 1 GPU run completed (the NVT simulation on RTX2080S), maybe by chance, reporting results similar to the CPU code.

Maybe I missed something, maybe my system is unstable, maybe I made something wrong, but I can’t understand why this is happening, I hope someone here will guide me.

Have a nice day,
Domenico

not_working_gpu_ELASTIC_T.tar.xz (884 KB)

akohlmey · March 24, 2020, 6:59pm

Segmentation faults can have many reasons. they can be caused by broken RAM, overheating (quite common with GPUs) or programming issues.

There are some known but extremely difficult to reproduce issues in the GPU package with running GPU accelerated calculations multiple times. Those were either resulting in memory leaks or in crashes. After the latest updates, we should have plugged all memory leaks, but it is not clear that all conditions that may cause crashes with multiple run statements (with or without “pre no”) have been resolved.

one way to rule out other issues would be to run some of the other example input that support acceleration with the GPU package but without using “run” multiple times. and then add some run statements. it is probably also worth checking with the very latest LAMMPS patch, and you might want to compare to using the KOKKOS package as well.

if you can confirm that the issue is only with the GPU package and only with the ELASTIC_T example, you should try to simplify the scripts and inputs until you have the most minimal system/input deck that reproduces the GPU issues and then report this here (or better as an issue on github), and then somebody with experience in programming the GPU package can look into what exactly is causing this and try to resolve it.

Thanks,
Axel.