Hi Axel,
Thanks for your response.
Yes, I did realize that these valgrind outputs are useless.
I also do want to figure this out using valgrind without MPI as recommended and totally agree that it is immensely more complicated when trying to debug with the GPU package.
Although I hadn’t mentioned earlier but I had checked this without the GPU package as well.
The result is the same as the other cases wherein Valgrind crashes.
However, I haven’t run longer simulations without the GPU package so am not sure if the issue is reproduced in such a case.
To test this, I have now started a longer simulation run without the GPU package using MPI alone.
Would you recommend running this on a single core as well?
You mention:
"…then you need to turn off features and functionality on the GPU side to reduce it to the bare minimum that still shows the issue. "
What features and functionality are you referring to here? It would help to know for running tests.
In the light of your comments and what I have mentioned above, it is possible that there is a compile flag issue during Lammps build as you’ve mentioned.
I am compiling LAMMPS using:
cmake -C …/cmake/presets/vt_all_on.cmake -C …/cmake/presets/vt_nolib.cmake …/cmake -DKokkos_ARCH_SKX=yes -DKokkos_ARCH_TURING75=yes -DKokkos_ENABLE_CUDA=yes -DKokkos_ENABLE_OPENMP=yes -DDOWNLOAD_SCAFACOS=yes -DCMAKE_CXX_COMPILER=/home/vthakore/ownCloud/Computation/src/pkgs/gitLammps/lib/kokkos/bin/nvcc_wrapper -DBUILD_SHARED_LIBS=on -DCMAKE_Fortran_COMPILER=/usr/bin/gfortran-4.8 -DLAMMPS_SIZES=bigbig
The preset files are attached.
The compilation completes successfully and links with the different libraries properly as per checks with:
ldd lmp (See output below.)
I request you to please let me know if the compilation flags look all right or if I am missing anything.
Looking at singularity containers and their use is something that I wish to learn and deploy because we are also building Lammps for multiple users on a GPU cluster.
I look forward to your response…
Thanks.
Warm regards,
Vaibhav.
Output from "ldd lmp"
/build-unstable-gcc7-ompi4$ ldd lmp
linux-vdso.so.1 (0x00007fff395be000)
libmpi.so.40 => /usr/local/lib/libmpi.so.40 (0x00007f16bdd93000)
libcudart.so.11.0 => /usr/local/cuda-11.0/lib64/libcudart.so.11.0 (0x00007f16bdb15000)
liblammps.so.0 (0x00007f16ad3ef000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f16acfe2000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f16acdca000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f16ac9d9000)
libopen-rte.so.40 => /usr/local/lib/libopen-rte.so.40 (0x00007f16ac722000)
libopen-pal.so.40 => /usr/local/lib/libopen-pal.so.40 (0x00007f16ac409000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f16ac201000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f16abe63000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f16abc44000)
/lib64/ld-linux-x86-64.so.2 (0x00007f16be2ca000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f16aba40000)
libcufft.so.10 => /usr/local/cuda-11.0/lib64/libcufft.so.10 (0x00007f16a1b7c000)
libjpeg.so.8 => /usr/lib/x86_64-linux-gnu/libjpeg.so.8 (0x00007f16a1914000)
libfftw3.so.3 => /usr/local/lib/libfftw3.so.3 (0x00007f16a1604000)
libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 (0x00007f16a13c1000)
libgsl.so.23 => /usr/lib/x86_64-linux-gnu/libgsl.so.23 (0x00007f16a0f5f000)
libgslcblas.so.0 => /usr/lib/x86_64-linux-gnu/libgslcblas.so.0 (0x00007f16a0d20000)
libmpi_usempi.so.40 => /usr/local/lib/libmpi_usempi.so.40 (0x00007f16a0b1d000)
libmpi_mpifh.so.40 => /usr/local/lib/libmpi_mpifh.so.40 (0x00007f16a08c3000)
libkokkoscore.so.3.4 => /home/vthakore/ownCloud/Computation/src/pkgs/gitLammps/build-unstable-gcc7-ompi4/lib/kokkos/core/src/libkokkoscore.so.3.4 (0x00007f16a056b000)
libgfortran.so.4 => /usr/lib/x86_64-linux-gnu/libgfortran.so.4 (0x00007f16a018c000)
libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f169ff6f000)
libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007f169fd6c000)
libquadmath.so.0 => /usr/lib/x86_64-linux-gnu/libquadmath.so.0 (0x00007f169fb25000)
vt_nolib.cmake (491 Bytes)
vt_all_on.cmake (1.07 KB)