Cuda illegal memory access(kokkos) when using multiple GPUs

Dear all,

I have encountered cuda illegal memory access(lib kokkos) when using multiple GPUs. The system is mixture of 2 beads, 3 beads, and 100 beads chains with harmonic bond and angle potential style. The atom style is set as angle and the pair style is lj/expand. The exact same system can be run on a multicore CPUs machine using OMP acceleration package. However, when switching to a multi-GPU machine and using KOKKOS acceleration package, the following error occured when I enable the KOKKOS package with more than one GPU using the following command.

mpirun -np 4 lmp -in in.relax -k on g 4 -sf kk -pk kokkos neigh half newton on

ERROR message:

cudaStreamSynchronize(stream) error( cudaErrorIllegalAddress): an illegal memory access was encountered /home/brunowong/Documents/lammps/lammps-28Mar2023/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:172

It can run without error with 1 GPU but failed with multiple GPUs. I also checked that the in.lj, in.chain benchmark can be run without error with either one or multiple GPUs.

The configuration of LAMMPS and my machine are as follows.

lammps version: Lammps(28Mar2023)
Installed packages:
CUDA version: cuda 12.1
MPI library: openmpi (4.1.5)
OS: Ubuntu 20.04
GPUs: 4 x NVIDIA GeForce GTX 1080 Ti

Any help is appreciated.

Please attach a simple and complete input script which can trigger the error.

1 Like

This looks like a bug. As @mkanski said, we need an as-small-as-possible input script that reproduces the issue, should be able to fix quickly with that.

1 Like

I find that the timestep that triggers the error depends on the size of my system. For a system of 400K atoms, the error would occur around a few 10K steps. For a system of 300K atoms, the error would occur around 100K steps. Please find the input script as attached. Thank you.

Data file for 400K atoms

Data file for 300K atoms

in.relax.reduced (1.1 KB)

Hmmm, I ran the 400k atom case out to 300k timesteps on 4 V100 GPUs but it didn’t crash.

I ran this on 2 different machines with V100 but can’t get it to crash. Can you try recompiling with Kokkos bounds checking: Kokkos_ENABLE_Debug_Bounds_Check=ON for CMake or KOKKOS_DEBUG=yes for Makefile? That should show if a view steps out of bounds.

Also can you try the latest develop branch from GitHub (that is what I am using). Thanks

I recompiled lammps using the latest develop branch from GitHub and with Kokkos bounds checking. The following error message showed up after 50000 steps.

cudaStreamSynchronize(stream) error( cudaErrorIllegalAddress): an illegal memory access was encountered /home/brunowong/Documents/lammps/lammps-develop/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:172
Kokkos::Impl::save_stacktrace() [0x7fe99de0a109]
Kokkos::Impl::traceback_callstack(std::ostream&) [0x7fe99ddfeb3e]
Kokkos::Impl::host_abort(char const*) [0x7fe99ddfeb6f]
Kokkos::Impl::cuda_internal_error_abort(cudaError, char const*, char const*, int) [0x7fe99de11374]
Kokkos::Impl::cuda_stream_synchronize(CUstream_st*, Kokkos::Impl::CudaInternal const*, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) [0x7fe99de116c3]
Kokkos::Impl::CudaParallelLaunchImpl<Kokkos::Impl::ParallelFor<LAMMPS_NS::NPairKokkosBuildFunctor<Kokkos::Cuda, 1, 1, 0>, Kokkos::TeamPolicyKokkos::Cuda, Kokkos::Cuda>, Kokkos::LaunchBounds<0u, 0u>, (Kokkos::Impl::Experimental::CudaLaunchMechanism)1>::launch_kernel(Kokkos::Impl::ParallelFor<LAMMPS_NS::NPairKokkosBuildFunctor<Kokkos::Cuda, 1, 1, 0>, Kokkos::TeamPolicyKokkos::Cuda, Kokkos::Cuda> const&, dim3 const&, dim3 const&, int, Kokkos::Impl::CudaInternal const*) [0x7fe99cab2cec]
LAMMPS_NS::NPairKokkos<Kokkos::Cuda, 1, 1, 0, 0, 0>::build(LAMMPS_NS::NeighList*) [0x7fe99cad7e88]
void LAMMPS_NS::NeighborKokkos::build_kokkosKokkos::Cuda(int) [0x7fe99ca28467]
LAMMPS_NS::VerletKokkos::run(int) [0x7fe99d05c595]
LAMMPS_NS::Run::command(int, char**) [0x7fe99c25571b]
LAMMPS_NS::Input::execute_command() [0x7fe99c0db61f]
LAMMPS_NS::Input::file() [0x7fe99c0db946]
__libc_start_main [0x7fe99b1a6083]
[lfl-SYS-7049GP-TRT:149339] *** Process received signal ***
[lfl-SYS-7049GP-TRT:149339] Signal: Aborted (6)
[lfl-SYS-7049GP-TRT:149339] Signal code: (-6)
[lfl-SYS-7049GP-TRT:149339] [ 0] /lib/x86_64-linux-gnu/[0x7fe99b5c5420]
[lfl-SYS-7049GP-TRT:149339] [ 1] /lib/x86_64-linux-gnu/[0x7fe99b1c500b]
[lfl-SYS-7049GP-TRT:149339] [ 2] /lib/x86_64-linux-gnu/[0x7fe99b1a4859]
[lfl-SYS-7049GP-TRT:149339] [ 3] /home/brunowong/Documents/lammps/lammps-develop/build/[0x7fe99ddfeb74]
[lfl-SYS-7049GP-TRT:149339] [ 4] /home/brunowong/Documents/lammps/lammps-develop/build/[0x7fe99de11374]
[lfl-SYS-7049GP-TRT:149339] [ 5] /home/brunowong/Documents/lammps/lammps-develop/build/[0x7fe99de116c3]
[lfl-SYS-7049GP-TRT:149339] [ 6] /home/brunowong/Documents/lammps/lammps-develop/build/[0x7fe99cab2cec]
[lfl-SYS-7049GP-TRT:149339] [ 7] /home/brunowong/Documents/lammps/lammps-develop/build/[0x7fe99cad7e88]
[lfl-SYS-7049GP-TRT:149339] [ 8] /home/brunowong/Documents/lammps/lammps-develop/build/[0x7fe99ca28467]
[lfl-SYS-7049GP-TRT:149339] [ 9] /home/brunowong/Documents/lammps/lammps-develop/build/[0x7fe99d05c595]
[lfl-SYS-7049GP-TRT:149339] [10] /home/brunowong/Documents/lammps/lammps-develop/build/[0x7fe99c25571b]
[lfl-SYS-7049GP-TRT:149339] [11] /home/brunowong/Documents/lammps/lammps-develop/build/[0x7fe99c0db61f]
[lfl-SYS-7049GP-TRT:149339] [12] /home/brunowong/Documents/lammps/lammps-develop/build/[0x7fe99c0db946]
[lfl-SYS-7049GP-TRT:149339] [13] /home/brunowong/Documents/lammps/lammps-develop/build/lmp_develop(+0xb081)[0x559213476081]
[lfl-SYS-7049GP-TRT:149339] [14] /lib/x86_64-linux-gnu/[0x7fe99b1a6083]
[lfl-SYS-7049GP-TRT:149339] [15] /home/brunowong/Documents/lammps/lammps-develop/build/lmp_develop(+0xb23e)[0x55921347623e]
[lfl-SYS-7049GP-TRT:149339] *** End of error message ***

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

mpirun noticed that process rank 1 with PID 0 on node lfl-SYS-7049GP-TRT exited on signal 6 (Aborted).

Looks like it is failing in the neighbor list build and is getting an out of bounds view error (that is the signal 6 abort). Can you try rebuilding with the -G flag so that device symbols are included in the stack trace? Unfortunately I don’t have any access to GeForce GTX 1080 Ti GPUs and it seems to run fine on V100. I could try on P100.

Since GTX 1080 Ti GPUs are consumer grade GPUs there also is a non-zero chance that after a while the GPUs become somewhat unreliable due to sustained generation of heat. This would be consistent with the observation that multiple GPUs are required. If the error also happens with just 2 GPUs, you could use the CUDA_VISIBLE_DEVICES environment variable to try different combinations to identify whether it happens in general with two GPUs or whether perhaps one of the GPUs is “weaker” than the others. After all, it is not that easy to have proper cooling in a case with those 4 GPUs. In general, the internal heat and clock rate adjustments in the GPU’s BIOS should prevent that overheating happens, but the throttling profiles are often tuned toward having a single GPU and not multiple.

Another point to consider is whether you can apply the GPU package instead of KOKKOS. Specifically for consumer GPUs, it can result in much faster computation, if you are willing to (and your required accuracy is compatible with a) switch to mixed precision computations. Since KOKKOS only uses the double precision floating point units which are significantly reduced in consumer GPUs versus the data center GPUs. Using the GPU package could also help to identify, if there is a reliability issue due to the generated heat or whether there could be a bug in the KOKKOS package (well, it may also trigger bugs in the GPU package, but those more often show up right at the initial kernel launches) and are leaning a bit more toward higher performance than reliability. The gaming market is very competitive and a few bad pixels usually do not create a problem with computer games, while some bad computed data can be a disaster with a computer simulation.

1 Like

I ran the 400k atom problem on 4x P100 GPUs for 200k timesteps but didn’t see the crash.

As Axel said, it is possible that one of your four GPUs is giving bad numerics. As was mentioned, you could try excluding certain GPUs with CUDA_VISIBLE_DEVICES and running on only 2 or 3 GPUs. If it always only fails on the same GPU then that is suspect.

I used CUDA_VISIBLE_DEVICES to exclude certain GPUs and found that using only dev 0, 2, 3 the problem can be run with no crash. Therefore, the source of error should be dev 1 giving bad numerics.

Thank you @stamoor and @akohlmey for your help and guidance!

1 Like