Kokkos GPU with multi body potential

Raphaelle · September 18, 2024, 5:14pm

Hello,
I compiled LAMMPS using the following command:
cmake …/cmake \
-D PKG_KOKKOS=ON
-D PKG_DIFFRACTION=ON \
-D PKG_ML-PACE=ON \
-D Kokkos_ENABLE_CUDA=ON \
-D Kokkos_ARCH_VOLTA70=ON \
-D CMAKE_CXX_COMPILER=$(whichmpicxx) \
-D CMAKE_CXX_STANDARD=17 \
-D BUILD_MPI=ON
I am using Intel Xeon 8268 CPUs on a single node of a supercomputer, and I requested 8 cores and 4 GPUs (NVidia Tesla V100 32 GPUs).
Now I have this at the beginning of my log file:
LAMMPS (17 Apr 2024)
KOKKOS mode with Kokkos version 4.3.0 is enabled (src/KOKKOS/kokkos.cpp:72)
will use up to 4 GPU(s) per node
using 1 OpenMP thread(s) per MPI task
package kokkos
package kokkos gpu/aware on newton on neigh half

But at the end of my run, I don’t get any information on how much of the GPUs were used and I just get this:
Loop time of 25457.1 on 8 procs for 100000 steps with 316008 atoms

Performance: 0.170 ns/day, 141.428 hours/ns, 3.928 timesteps/s, 1.241 Matom-step/s
98.1% CPU use with 8 MPI tasks x 1 OpenMP threads

MPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %total

Pair | 23907 | 24045 | 24208 | 63.4 | 94.45
Neigh | 44.587 | 46.305 | 48.571 | 20.8 | 0.18
Comm | 1020.6 | 1185.6 | 1323.2 | 285.9 | 4.66
Output | 0.37128 | 0.39659 | 0.44973 | 3.7 | 0.00
Modify | 145.31 | 147.18 | 149.29 | 11.1 | 0.58
Other | | 32.22 | | | 0.13

I have the impression the GPUs weren’t used, I mean I don’t know.
I ran my script with the following command:

mpirun -np $nprocs ~/mylammps2024/build/lmp -pk kokkos gpu/aware on newton on neigh half -k on g 4 -sf kk -in test.in.md > out_GPU.txt
Having 8 MPI processes was probably a mistake but I thought I’d get information on the GPU use and nothing.

Moreover, I used to run my job on 400 cores with 400 MPI processes. I have 300000 atoms and a complicated multi body potential. If I understand correctly, I should have one MPI process per GPU. How can I get information on how much of the GPUs are used during the run?

Thanks

akohlmey · September 18, 2024, 6:16pm

There is no internal support for querying GPU utilization since that is very hardware and driver specific. But you can use external tools like nvidia-smi for that.

The rules from the LAMMPS side are straightforward:

you need 1 MPI task per GPU
you need to tell LAMMPS how many GPUs per node you want to use
it is theoretically possible to use multiple MPI tasks per GPU, but that is only beneficial if you have significant parts of the calculation that are not supported by KOKKOS and it requires the use of the CUDA MPS daemon to be efficient

A simple way to gauge the GPU speedup is to run the same (short, test) job on the same hardware with the same number of MPI tasks, once with KOKKOS enabled and once without and then compare the timing.

Please keep in mind that GPUs require many more work units per processor than CPUs to be efficient and thus they require more atoms per MPI task if you want to use them efficiently. It is generally advisable (for CPU or GPU based jobs) to make strong scaling tests before starting any long running production calculation to determine how to best utilize the available resources. The “more is always better” paradigm does not apply and there are many factors that may need to be considered.

Raphaelle · September 18, 2024, 6:29pm

Thanks so much! I specify 4 GPUs per node and 4 MPI processes then for instance. Then LAMMPS will automatically assign one MPI per GPU or do I need another command to specify that too?

akohlmey · September 18, 2024, 6:48pm

I have already answered this. If you need more details, please study the LAMMPS documentation. There is a long section with lots of details about running with KOKKOS.

Raphaelle · September 18, 2024, 7:00pm

If I read correctly, you told me that I need 1 MPI task per GPU, you did not specify if LAMMPS is going to detect it automatically or if I need to specify that in a command line.
Thanks for the help!

akohlmey · September 18, 2024, 7:06pm

As I already stated, please read the documentation. That is why we write it. I don’t like being used as a quasi-chatbot.

The number of MPI tasks (in total and per node) is determined by your batch system and/or your mpirun/mpiexec/srun command. The rest follows logically.

Raphaelle · September 18, 2024, 7:20pm

Thank you!

mkanski · September 19, 2024, 9:18am

If you use SLURM queuing system (and admins allow it), you can log in to a server with a running job and check GPU utilization.

Command for logging in to a job:
srun --pty --overlap --jobid $JOBID bash
where $JOBID is the id of your job.

Then you can use nvidia-smi command to check GPU utilization:
nvidia-smi -l 1 --format=csv --query-gpu=memory.used,utilization.gpu

This will poll GPU utilization and used GPU memory each second.

This approach has a disadvantage - you will see only GPUs you have access to on a single node, so if you have a multi-node job, you can check only a part of GPUs.

akohlmey · September 19, 2024, 11:08am

FWIW, the same exists for PBS/Torque clusters, too. It is even simpler, because you can ssh passwordless into any node on which your job is running while it is running.

Kokkos GPU with multi body potential

MPI task timing breakdown: Section | min time | avg time | max time |%varavg| %total

MPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %total