Cuda illegal memory access(kokkos) multiple MPI per GPU

I have encountered cuda illegal memory access(lib kokkos) when using multiple MPI per GPU. I have tried lower version of openmpi, same error appear. Really appreciate your help.

Lammps(sep 15)
cuda 11.7.1
gcc 11.3.0
openmpi (4.0.5/4.1.4)

terminate called after throwing an instance of ‘std::runtime_error’
what(): cudaDeviceSynchronize() error( cudaErrorIllegalAddress): an illegal memory access was encountered /home/lammps-new/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:151
[ASUS:682130] *** Process received signal ***
[ASUS:682130] Signal: Aborted (6)
[ASUS:682130] Signal code: (-6)
[ASUS:682130] [ 0] /lib64/[0x14a945feddc0]
[ASUS:682130] [ 1] /lib64/[0x14a94603a56c]
[ASUS:682130] [ 2] /lib64/[0x14a945fedd16]
[ASUS:682130] [ 3] /lib64/[0x14a945fc17f3]
[ASUS:682130] [ 4] /home/superroot/nfsshare/easybuild/software/GCCcore/11.3.0/lib64/[0x14a94634196a]
[ASUS:682130] [ 5] /home/superroot/nfsshare/easybuild/software/GCCcore/11.3.0/lib64/[0x14a94634cf9a]
[ASUS:682130] [ 6] /home/superroot/nfsshare/easybuild/software/GCCcore/11.3.0/lib64/[0x14a94634d005]
[ASUS:682130] [ 7] /home/superroot/nfsshare/easybuild/software/GCCcore/11.3.0/lib64/[0x14a94634d259]
[ASUS:682130] [ 8] /home/ruisi/lammps-new/build_gpu/build_gpu_KOKKOS_gnu_openmpi_MKL_OPT_OMP_INTEL/lmp[0x583cd1]
[ASUS:682130] [ 9] /home/lammps-new/build_gpu/build_gpu_KOKKOS_gnu_openmpi_MKL_OPT_OMP_INTEL/lmp[0x3ee3b15]
[ASUS:682130] [10] /home/lammps-new/build_gpu/build_gpu_KOKKOS_gnu_openmpi_MKL_OPT_OMP_INTEL/lmp[0x3ee44ea]
[ASUS:682130] [11] /home/lammps-new/build_gpu/build_gpu_KOKKOS_gnu_openmpi_MKL_OPT_OMP_INTEL/lmp[0x3eb52ed]
[ASUS:682130] [12] /home/lammps-new/build_gpu/build_gpu_KOKKOS_gnu_openmpi_MKL_OPT_OMP_INTEL/lmp[0xbf9397]
[ASUS:682130] [13] /home/lammps-new/build_gpu/build_gpu_KOKKOS_gnu_openmpi_MKL_OPT_OMP_INTEL/lmp[0xbf9c86]
[ASUS:682130] [14] /home/lammps-new/build_gpu/build_gpu_KOKKOS_gnu_openmpi_MKL_OPT_OMP_INTEL/lmp[0x1f7512f]
[ASUS:682130] [15] /home/lammps-new/build_gpu/build_gpu_KOKKOS_gnu_openmpi_MKL_OPT_OMP_INTEL/lmp[0x20c2334]
[ASUS:682130] [16] /home/lammps-new/build_gpu/build_gpu_KOKKOS_gnu_openmpi_MKL_OPT_OMP_INTEL/lmp[0x71ba16]
[ASUS:682130] [17] /home/lammps-new/build_gpu/build_gpu_KOKKOS_gnu_openmpi_MKL_OPT_OMP_INTEL/lmp[0x63bcef]
[ASUS:682130] [18] /home/lammps-new/build_gpu/build_gpu_KOKKOS_gnu_openmpi_MKL_OPT_OMP_INTEL/lmp[0x63c1de]
[ASUS:682130] [19] /home/lammps-new/build_gpu/build_gpu_KOKKOS_gnu_openmpi_MKL_OPT_OMP_INTEL/lmp[0x5b040d]
[ASUS:682130] [20] /lib64/[0x14a945fd8eb0]
[ASUS:682130] [21] /lib64/[0x14a945fd8f60]
[ASUS:682130] [22] /home/lammps-new/build_gpu/build_gpu_KOKKOS_gnu_openmpi_MKL_OPT_OMP_INTEL/lmp[0x6173c5]

With KOKKOS, you should have only one MPI rank per GPU. There is no benefit to oversubscribing (unlike with the GPU package). If you need additional parallelism, you should also enable OpenMP support when compiling LAMMPS with KOKKOS and then also use OpenMP threads.

Also, I have witnessed that kokkos/cuda has worse performance compared to GPU package only.(with one MPI rank per GPU / 1 MPI task 2 MP threads). Wondering if you have any suggestions on achieving the claimed acceleration of kokkos package. Really appreciate your help.

Performance depends on many factors. The KOKKOS and the GPU package have different strategies in utilizing the GPU that work differently well for different systems and environments. The GPU package only accelerates the part of the calculation that accelerates very well and can overlap computation on the GPU and the CPU which requires moving position data into the GPU and retrieving force data from the GPU at every step. The KOKKOS package tries to keep data on the GPU as much possible and thus minimize the cost of the transfer up to the point of enabling GPU to GPU transfer with supported MPI libraries. Thus KOKKOS is faster when: all used styles (for forces and fixes and computes) support KOKKOS so that data transfers between host and GPU are minimized, and also there are many atoms per GPU.

In addition, please note that the default settings for the GPU package is to use mixed precision floating point numbers (i.e. compute forces in single precision but accumulate them in double precision and similar) while KOKKOS only supports double precision. Depending on your GPU hardware, that can result in significantly different performance for the same settings (1 MPI per GPU, no threads). So to do an “apples-to-apples” comparison of the performance you would have to request compiling the GPU package and library with double precision. Of course, if the physics of your system is such that using mixed precision is sufficiently accurate, then you should take advantage of the additional speedup.

Bottom line, it very much depends on the details which package is the better choice. You have to observe and test carefully. There is no general “do this not that” kind of advice that will always result in the best performance and required accuracy.

Please also note the discussion of improving LAMMPS performance in this chapter of the manual: 7. Accelerate performance — LAMMPS documentation

@rose what pair style are you using? The KOKKOS package does support using multiple MPI ranks/GPU, and while generally this is not faster if everything is already on the GPU, an exception is when some kernels are running on the host CPU. Except for mixed precision, the Kokkos package can also mimic the GPU package by using the “pair/only” package option, see package command — LAMMPS documentation.

Also note: if you are using multiple MPI/GPU, with either the GPU package or KOKKOS package, you will want to enable CUDA MPS (multi-process service), which can give a significant speedup in this case.

Hi, I am using lj/charmm/coul/long. wondering if this is supported. I have tried pair/only, which do help in mimicing the GPU package.

You can see whether a pair style supports acceleration either from the individual pair style page of the overview here: 5.8. Pair_style potentials — LAMMPS documentation
If a pair style supports one of the accelerator packages, it is indicated with a letter g, k, i, o, t.

yes, I mean I can see from the doc that this pair style is supported in kokkos package, but still I failed to run it with multiple mpi tasks/gpu(as stamoor mentioned), so I am quite confused on this. Appreciate your help

If it is supported by KOKKOS then multiple MPI ranks per GPU should work (however inadvisable that may be). To track down what may be a cause, Stan will need a minimal (but complete) input deck so he can debug it. Also you should report what GPU hardware you are using.

I don’t really understand your obsession with running KOKKOS this way. As already mentioned, the design of KOKKOS is different and thus the gains are likely minimal or negative.

Hi, the reason I want to try multiple mpi ranks/gpu is because, I haven’t fully utilized GPU memory and communication bandwidth. For debugging, I am using 8 A100s. The script I used is OMP_PROC_BIND=spread OMP_PLACES=threads mpirun -np 16 --report-bindings --map-by socket --bind-to core /home/ruisi/lammps-new/build_gpu/test/lmp -sf gpu -pk gpu 8 -k on g 8 -sf kk -pk kokkos newton on neigh half -in in.rhodo

This command line makes no sense. Using the GPU package and the KOKKOS package at the same time is a very, VERY bad idea.

Also, utilizing RAM or memory bandwidth to the maximum should not be an optimization goal, but maximizing GPU utilization or rather absolute speed. Since oversubscribing GPUs comes with overhead, it can be faster to not maximize utilization.

I have to advise you again to carefully study the section in the LAMMPS manual about optimizing performance.

In addition, while in.rhodo is a standard benchmark, ultimately what you should try to optimise is performance for your system. The settings which work best for in.rhodo, or even multiple copies of in.rhodo (using replicate), will not necessarily be the same as the settings for your system.

In addition, the time you spend optimising simulations is time you are spending not actually running your simulation in production. There is always a trade-off between the two, and while it is always worth experimenting a little bit, you need to check if your simulations might already be fast enough for you to Just Run Things and get your work done.

Thank you for your advice, now I can successfully run it. I would also like to suggest indicating which accelerator packages can be used concurrently in the manual. This would be a great help.

Appreciate your reply.

Even if this would be possible, your command line would have been incompatible with that, since you can only have one suffix setting at a time and you would have needed to use the hybrid syntax there: suffix command — LAMMPS documentation

I think nobody has even considered that somebody would get the crazy idea to use two different GPU accelerated packages at the same time.

Just to confirm, for CPU packages(OMP, OPT, INTEL), they are allowed to be used concurrently?

Yes, a typical use is INTEL as the primary suffix and then falling back to OPENMP if the style is not available in INTEL.

understand, thank you for your answering