[lammps-users] Memory leak when using the GPU-package

Hello Everyone,

I am getting a memory leak issue when using the GPU-package for computing the viscosity and thermal conductivity for liquid ethylene.

I am using the LAMMPS patch release version 2 July 2021.

The details of my system are:

Operating system: Ubuntu Linux 18.04 LTS
GPU card: Nvidia GeForce RTX 2080 Ti
Nvidia Driver: 450.119.03

I have tried to diagnose the issue using “valgrind --leack-check=full” with the following three options:

  1. With 2 MPI procs (Valgrind Output → memCheckOutMpi-2.txt)

  2. With 1 MPI proc (Valgrind Output → memCheckOutMpi-1.txt)

  3. Without MPI (Valgrind Output → memCheckOut-NoMpi.txt)
    The simulation itself runs fine when using Valgrind for the first two cases.

However, for the third case when running Valgrind without using the “mpirun” command, Valgrind itself crashes.

The simulation also runs fine with and without “mpirun” and Valgrind.

The Lammps input script, molecule file and the output logfile for the same are also attached.

Please do let me know if there is any more information required from my side to reproduce and resolve or shed light on this issue.

I look forward to hearing from you.

Thanks.

Warm regards,
Vaibhav.

log.lammps (179 KB)

memCheckOut-NoMpi.txt (11.5 KB)

memCheckOutMpi-1.txt (8.98 KB)

memCheckOutMpi-2.txt (9.56 KB)

in.emdCombo (3.41 KB)

c2h4_ua2Trappe.molecule (184 Bytes)

Do you have any evidence, that those memory leaks are not happening without using the GPU package?

The checks that were running with MPI were not checking LAMMPS, but the mpirun executable (that is why you didn’t get the illegal instruction crash).

to check LAMMPS, you need to use the following command line:

mpirun -np 1 valgrind --leak-check=full glmp-unstable-gcc7 -nc -sc none -l log.lammps -sf gpu -pk gpu 1 -in in.emdCombo

which will give you the same crash as you got without mpirun.

OpenMPI does some lazy allocation scheme and uses some tricks that can fool valgrind’s memcheck tool.
We have some suppressions for that, that you may want to try out in the tools folder: https://github.com/lammps/lammps/tree/master/tools/valgrind

But over time this has become a bottomless pit (it is taking enough time and effort to track changes in the OpenMP run time libraries to keep those suppressions working), so these days, I do not check for memory leaks when running executables compiled with OpenMPI (regardless of with or without, but only using MPICH.

The illegal instruction error suggests that your LAMMPS executable was compiled on a CPU with instructions not known by the valgrind version you are using.

So I suggest you first confirm that you have no memory leaks in LAMMPS itself and without the GPU package and that you can run valgrind correctly on the LAMMPS executable on the machine that you are testing on.

so far there is no confirmed leak at all.

Hi Axel,

Thanks very much for your response and pointers on process to understand the issue.

Steps taken to identify the issue:

  1. I reinstalled the latest version of valgrind.

  2. The new installation checks out fine using: “valgrind ls -l | grep ompi”. See output below.

  3. I ran valgrind again on the lammps script with the recommended suppressions but still get 3 Errors from 3 contexts.

  4. See attached file “memCheckOut-WithSuppression-NoMpi.txt”
    This issue is real because when I run my simulations using the command below, after about a million time steps the memory usage on the cpu-side (2 MPI procs) balloons to ~1.3 Gb from around ~195 Mb each before crashing.

“mpirun -np 2 glmp-unstable-gcc7 -nc -sc none -l log.lammps -sf gpu -pk gpu 1 -in in.emdCombo”

A couple of follow-up requests/questions:

  1. Given your remarks related to there being no testing for openMPI, would you recommend switching to MPICH?
  2. I am wondering if it’s possible for the LAMMPS Developers to list recommended versions for gcc, openmp, openMPI/MPICH, CUDA, versions of Graphics drivers for the commonly used hardware and so on? This could help avoid a lot of confusion and potential issues that arise from the use of non-standard libraries that LAMMPS is not being rigorously tested against by the Developers.

As always, grateful for your time, help and kind consideration. :pray:

Warm regards,
Vaibhav.

Valgrind Test Output
/usr/bin$ valgrind ls -l | grep ompi
==7862== Memcheck, a memory error detector
==7862== Copyright (C) 2002-2017, and GNU GPL’d, by Julian Seward et al.
==7862== Using Valgrind-3.18.0.GIT and LibVEX; rerun with -h for copyright info
==7862== Command: ls -l
==7862==
lrwxrwxrwx 1 root root 53 Mar 12 12:27 glib-compile-schemas → …/lib/x86_64-linux-gnu/glib-2.0/glib-compile-schemas
lrwxrwxrwx 1 root root 10 Feb 5 2018 ompi-clean → orte-clean
-rwxr-xr-x 1 root root 26624 Feb 5 2018 ompi_info
lrwxrwxrwx 1 root root 7 Feb 5 2018 ompi-ps → orte-ps
lrwxrwxrwx 1 root root 11 Feb 5 2018 ompi-server → orte-server
lrwxrwxrwx 1 root root 8 Feb 5 2018 ompi-top → orte-top
-rwxr-xr-x 1 root root 12119 Oct 25 2018 py3compile
-rwxr-xr-x 1 root root 11895 Apr 16 2018 pycompile
-rwxr-xr-x 1 root root 1973704 Oct 10 2018 teckit_compile
==7862==
==7862== HEAP SUMMARY:
==7862== in use at exit: 749,780 bytes in 3,332 blocks
==7862== total heap usage: 6,860 allocs, 3,528 frees, 2,131,539 bytes allocated
==7862==
==7862== LEAK SUMMARY:
==7862== definitely lost: 0 bytes in 0 blocks
==7862== indirectly lost: 0 bytes in 0 blocks
==7862== possibly lost: 0 bytes in 0 blocks
==7862== still reachable: 749,780 bytes in 3,332 blocks
==7862== suppressed: 0 bytes in 0 blocks
==7862== Rerun with --leak-check=full to see details of leaked memory
==7862==
==7862== For lists of detected and suppressed errors, rerun with: -s
==7862== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

memCheckOut-WithSuppression-NoMpi.txt (6.54 KB)

Hi Axel,

Thanks very much for your response and pointers on process to understand the issue.

Steps taken to identify the issue:

  1. I reinstalled the latest version of valgrind.

  2. The new installation checks out fine using: “valgrind ls -l | grep ompi”. See output below.

  3. I ran valgrind again on the lammps script with the recommended suppressions but still get 3 Errors from 3 contexts.

  4. See attached file “memCheckOut-WithSuppression-NoMpi.txt”

That file has the same illegal instruction error as before and is thus completely useless. All leaks reported there are bogus since LAMMPS didn’t finish in a regular fashion.
I suspect that this may be due to your choice of flags when compiling LAMMPS.

This issue is real because when I run my simulations using the command below, after about a million time steps the memory usage on the cpu-side (2 MPI procs) balloons to ~1.3 Gb from around ~195 Mb each before crashing.

“mpirun -np 2 glmp-unstable-gcc7 -nc -sc none -l log.lammps -sf gpu -pk gpu 1 -in in.emdCombo”

since debugging anything with a GPU is an order of magnitude more complex than on the CPU. You need to first confirm that there is no leak on the CPU.
then you need to turn off features and functionality on the GPU side to reduce it to the bare minimum that still shows the issue.

also, if this is something that happens after so many timesteps, a simple workaround is to break down your calculation into multiple parts and then use restarts and the run start/stop/upto keywords to continue seamlessly.

A couple of follow-up requests/questions:

  1. Given your remarks related to there being no testing for openMPI, would you recommend switching to MPICH?

you are misunderstanding what I said. We are still testing with OpenMPI and, in fact, use OpenMPI exclusively on the HPC facilities here at Temple, but we don’t use it for searching for memory leaks with valgrind.

  1. I am wondering if it’s possible for the LAMMPS Developers to list recommended versions for gcc, openmp, openMPI/MPICH, CUDA, versions of Graphics drivers for the commonly used hardware and so on? This could help avoid a lot of confusion and potential issues that arise from the use of non-standard libraries that LAMMPS is not being rigorously tested against by the Developers.

there is no real need for that, since LAMMPS is highly portable and most of the reasons for using a specific version of a compiler or library are bugs in that software, which is not something that we have control over or have the time to track down. if there is a memory leak, it will be seen on all platforms unless it is due to the external software (like CUDA driver or toolkit runtime). given the complexity of LAMMPS it is impossible to test for and identify problems for all possible permutations of software and hardware and combinations of compiled-in features and input script variations. if there is a real problem in LAMMPS it won’t go away from using recommended hard- and software and there are just too many different versions of hard- and software that LAMMPS is running well on and so few where it doesn’t, so that it is impossible to make such recommendations with more confidence than stating that any version should work, like we currently do.

if you want to compile LAMMPS exactly the way we do for our integration, regression, and unit testing, you can use the same singularity containers that we use: https://github.com/lammps/lammps/tree/master/tools/singularity
and the scripts and configurations we use to build and test LAMMPS: https://github.com/lammps/lammps-testing/

axel.

Hi Axel,

Thanks for your response.

Yes, I did realize that these valgrind outputs are useless.

I also do want to figure this out using valgrind without MPI as recommended and totally agree that it is immensely more complicated when trying to debug with the GPU package.

Although I hadn’t mentioned earlier but I had checked this without the GPU package as well.

The result is the same as the other cases wherein Valgrind crashes.

However, I haven’t run longer simulations without the GPU package so am not sure if the issue is reproduced in such a case.

To test this, I have now started a longer simulation run without the GPU package using MPI alone.

Would you recommend running this on a single core as well?

You mention:
"…then you need to turn off features and functionality on the GPU side to reduce it to the bare minimum that still shows the issue. "
What features and functionality are you referring to here? It would help to know for running tests.

In the light of your comments and what I have mentioned above, it is possible that there is a compile flag issue during Lammps build as you’ve mentioned.

I am compiling LAMMPS using:

cmake -C …/cmake/presets/vt_all_on.cmake -C …/cmake/presets/vt_nolib.cmake …/cmake -DKokkos_ARCH_SKX=yes -DKokkos_ARCH_TURING75=yes -DKokkos_ENABLE_CUDA=yes -DKokkos_ENABLE_OPENMP=yes -DDOWNLOAD_SCAFACOS=yes -DCMAKE_CXX_COMPILER=/home/vthakore/ownCloud/Computation/src/pkgs/gitLammps/lib/kokkos/bin/nvcc_wrapper -DBUILD_SHARED_LIBS=on -DCMAKE_Fortran_COMPILER=/usr/bin/gfortran-4.8 -DLAMMPS_SIZES=bigbig

The preset files are attached.

The compilation completes successfully and links with the different libraries properly as per checks with:
ldd lmp (See output below.)

I request you to please let me know if the compilation flags look all right or if I am missing anything.

Looking at singularity containers and their use is something that I wish to learn and deploy because we are also building Lammps for multiple users on a GPU cluster.

I look forward to your response…

Thanks.

Warm regards,
Vaibhav.

Output from "ldd lmp"
/build-unstable-gcc7-ompi4$ ldd lmp
linux-vdso.so.1 (0x00007fff395be000)
libmpi.so.40 => /usr/local/lib/libmpi.so.40 (0x00007f16bdd93000)
libcudart.so.11.0 => /usr/local/cuda-11.0/lib64/libcudart.so.11.0 (0x00007f16bdb15000)
liblammps.so.0 (0x00007f16ad3ef000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f16acfe2000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f16acdca000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f16ac9d9000)
libopen-rte.so.40 => /usr/local/lib/libopen-rte.so.40 (0x00007f16ac722000)
libopen-pal.so.40 => /usr/local/lib/libopen-pal.so.40 (0x00007f16ac409000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f16ac201000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f16abe63000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f16abc44000)
/lib64/ld-linux-x86-64.so.2 (0x00007f16be2ca000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f16aba40000)
libcufft.so.10 => /usr/local/cuda-11.0/lib64/libcufft.so.10 (0x00007f16a1b7c000)
libjpeg.so.8 => /usr/lib/x86_64-linux-gnu/libjpeg.so.8 (0x00007f16a1914000)
libfftw3.so.3 => /usr/local/lib/libfftw3.so.3 (0x00007f16a1604000)
libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 (0x00007f16a13c1000)
libgsl.so.23 => /usr/lib/x86_64-linux-gnu/libgsl.so.23 (0x00007f16a0f5f000)
libgslcblas.so.0 => /usr/lib/x86_64-linux-gnu/libgslcblas.so.0 (0x00007f16a0d20000)
libmpi_usempi.so.40 => /usr/local/lib/libmpi_usempi.so.40 (0x00007f16a0b1d000)
libmpi_mpifh.so.40 => /usr/local/lib/libmpi_mpifh.so.40 (0x00007f16a08c3000)
libkokkoscore.so.3.4 => /home/vthakore/ownCloud/Computation/src/pkgs/gitLammps/build-unstable-gcc7-ompi4/lib/kokkos/core/src/libkokkoscore.so.3.4 (0x00007f16a056b000)
libgfortran.so.4 => /usr/lib/x86_64-linux-gnu/libgfortran.so.4 (0x00007f16a018c000)
libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f169ff6f000)
libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007f169fd6c000)
libquadmath.so.0 => /usr/lib/x86_64-linux-gnu/libquadmath.so.0 (0x00007f169fb25000)

vt_nolib.cmake (491 Bytes)

vt_all_on.cmake (1.07 KB)

i have no time to debug your compilation approach or train you in how to do things best.
most importantly, since I have no knowledge of the details of the setup of your machine, there are too many unknowns in the procedure.
i have pointed out what we are doing and what is a reasonable approach to narrow down to identifying a problem and its origin.
if you want one of the LAMMPS developers to debug a memory leak, you have to do a much better job at identifying the origin and providing us with suitable evidence of a problem and manufacture an input deck so the problem can be reproduced with reasonable effort.
i have no more to add and explain to what i wrote in previous emails. the rest is up to you.

Fair enough, point taken…
I will investigate more.

Thanks again for your time and help.

Warm regards,
Vaibhav.

Hi Axel,

I investigated the memory leak further.

There appear to be no leaks in the LAMMPS code when I run the script using just a single MPI rank. Please see attached file “memCheckOut-defMarch.txt”.

The leaks though show up when I run it using the GPU-package with no MPI ranks.
The simulation runs fine when using Valgrind.

However, unlike previously, the Valgrind report now contains information on the leaks.
See attached file “memCheckOut-WithSuppr-defMarch-NoMpi-Gpu-noExtraOpts.txt”

For convenience, I am attaching the script, molecule file and the output logfile again.

Please let me know your thoughts on whether this is still a result of system settings or faulty compilation on my end or if this is real.
Also, if you need anything else from my side to help further debug the issue.

Thanks again for your time and help.

Warm regards,
Vaibhav.

memCheckOut-WithSuppr-defMarch-NoMpi-Gpu-noExtraOpts.txt (186 KB)

memCheckOut-defMarch.txt (1.8 KB)

in.emdCombo (3.41 KB)

c2h4_ua2Trappe.molecule (184 Bytes)

log.lammps (179 KB)

could you please make another test where you modify the -pk flag from “-pk gpu 1” to “-pk gpu 1 pair/only yes”

most of the reported leaks are within the OpenCL libs and thus beyond our control.

I can confirm that the change in this commit:
https://github.com/lammps/lammps/pull/2876/commits/6f46ac57b937bd4ee7f0f2bdd6388eee315749aa

will remove a couple of small “definite” leaks (direct and indirect) from the pppm/gpu style.

I suspect that most of the remaining reports are bogus and at any rate require a GPU programming expert with experience beyond mine to check out.

Hi Axel,

Thank you for the changes to the pppm/gpu to fix the memory leaks.

Please find attached the Valgrind output for the run with the recommended “pair/only on” command line switch.
This is of course without the recent commit to fix the pppm/gpu style that you’ve mentioned in your email.

Question: Will the other aspects of the leaks pertaining to openCL be looked into by a LAMMPS developer who is also an expert in GPU programming?

With the changes in the code from your recent commit/fix to the pppm/gpu style, I will check as to how much of an improvement there is and then let you know likewise.

Again, please do let me know if anything more is still required…

Thanks.

Warm regards,
Vaibhav.

memCheckOut-WithSuppr-defMarch-NoMpi-Gpu-pairOnlyOn.txt (175 KB)

I cannot make any promises right now. I have contacted somebody, but since people are volunteering their time to work on these things (all of the original authors of the GPU package have moved on and have other commitments. they will provide updates to the code when they have time and/or it is commensurate with their jobs or personal needs) we don’t know whether this will be followed up sooner or later.

if you want to make another check, please compile a LAMMPS executable with CUDA instead of OpenCL and check if there is a difference.

Hi Axel,

Thanks very much for your message.

I tried building LAMMPS with GPU package using CUDA-backend (-DGPU_API=cuda option).
The compilation runs error free and I get the LAMMPS executable.

However, two things occur:

  1. The linking is not proper for the executable

  2. libkokkoscontainers.so.3.4 => not found
    libkokkoscore.so.3.4 => not found

  3. Notably, this does not occur for the default openCL backend.

  4. And, then when I run the executable with Valgrind I also get the following error:
    glmp-unstable-cuda: error while loading shared libraries: liblammps.so.0: cannot open shared object file: No such file or directory

I have had trouble using the CUDA-backend for the GPU-package earlier as well but never reported it since the openCL backend works fine.
Also, just a note that I compile the Kokkos package with both the openCL backend as well as the CUDA-backend for the GPU-package.

With the openCL backend for the GPU package, I recompiled LAMMPS with the memory leak fixes that you’d suggested and the leaks reduce to just 24 bytes in a block which appears to be specific to the GPU-programming aspect that you had mentioned.
Please find attached the Valgrind output for the same (file → memCheck-postfix-gpu-ocl.txt) .

I also profiled the Kokkos package with Valgrind and found some small memory leaks (168 bytes definitely lost) there as well.
See attached files → memCheckOut-kokkos.txt.
The LAMMPS script, logfile output and the molecule file are also attached for convenience.

I must note here that I haven’t so far run longer simulations with the Kokkos-package so I am not sure about the severity of the issue.

Thanks.

Warm regards,
Vaibhav.

memCheck-postfix-gpu-ocl.txt (181 KB)

memCheckOut-kokkos.txt (176 KB)

log.lammps-memCheck-kokkos (180 KB)

in.emdCombo (3.41 KB)

c2h4_ua2Trappe.molecule (184 Bytes)

sorry, but you have reached the end of my patience. you keep coming up with issues that demonstrate a lack of basic Linux/Unix knowledge that is just puzzling and following strategies that make little sense (e.g. why compile with KOKKOS when you want to debug the GPU package?).
I don’t have the time to train you in these things. If you can produce meaningful data that is not riddled with beginner’s problem, come back. But don’t expect any more time from me spent on this otherwise. I already spent too much for too little in return.

Hi Axel,

With the CUDA-backend, there are no “definite leaks” in the GPU package. See attached file → memCheckOut-gpu-cuda.txt
You’re right! For debugging purposes, it does not make sense to compile LAMMPS with unnecessary extra packages.

Warm regards,
Vaibhav.

memCheckOut-gpu-cuda.txt (24.5 KB)