Hello,
I compiled LAMMPS using the following command:
cmake …/cmake \
-D PKG_KOKKOS=ON
-D PKG_DIFFRACTION=ON \
-D PKG_ML-PACE=ON \
-D Kokkos_ENABLE_CUDA=ON \
-D Kokkos_ARCH_VOLTA70=ON \
-D CMAKE_CXX_COMPILER=$(whichmpicxx) \
-D CMAKE_CXX_STANDARD=17 \
-D BUILD_MPI=ON
I am using Intel Xeon 8268 CPUs on a single node of a supercomputer, and I requested 8 cores and 4 GPUs (NVidia Tesla V100 32 GPUs).
Now I have this at the beginning of my log file:
LAMMPS (17 Apr 2024)
KOKKOS mode with Kokkos version 4.3.0 is enabled (src/KOKKOS/kokkos.cpp:72)
will use up to 4 GPU(s) per node
using 1 OpenMP thread(s) per MPI task
package kokkos
package kokkos gpu/aware on newton on neigh half
But at the end of my run, I don’t get any information on how much of the GPUs were used and I just get this:
Loop time of 25457.1 on 8 procs for 100000 steps with 316008 atoms
Performance: 0.170 ns/day, 141.428 hours/ns, 3.928 timesteps/s, 1.241 Matom-step/s
98.1% CPU use with 8 MPI tasks x 1 OpenMP threads
MPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %total
I have the impression the GPUs weren’t used, I mean I don’t know.
I ran my script with the following command:
mpirun -np $nprocs ~/mylammps2024/build/lmp -pk kokkos gpu/aware on newton on neigh half -k on g 4 -sf kk -in test.in.md > out_GPU.txt
Having 8 MPI processes was probably a mistake but I thought I’d get information on the GPU use and nothing.
Moreover, I used to run my job on 400 cores with 400 MPI processes. I have 300000 atoms and a complicated multi body potential. If I understand correctly, I should have one MPI process per GPU. How can I get information on how much of the GPUs are used during the run?
There is no internal support for querying GPU utilization since that is very hardware and driver specific. But you can use external tools like nvidia-smi for that.
The rules from the LAMMPS side are straightforward:
you need 1 MPI task per GPU
you need to tell LAMMPS how many GPUs per node you want to use
it is theoretically possible to use multiple MPI tasks per GPU, but that is only beneficial if you have significant parts of the calculation that are not supported by KOKKOS and it requires the use of the CUDA MPS daemon to be efficient
A simple way to gauge the GPU speedup is to run the same (short, test) job on the same hardware with the same number of MPI tasks, once with KOKKOS enabled and once without and then compare the timing.
Please keep in mind that GPUs require many more work units per processor than CPUs to be efficient and thus they require more atoms per MPI task if you want to use them efficiently. It is generally advisable (for CPU or GPU based jobs) to make strong scaling tests before starting any long running production calculation to determine how to best utilize the available resources. The “more is always better” paradigm does not apply and there are many factors that may need to be considered.
Thanks so much! I specify 4 GPUs per node and 4 MPI processes then for instance. Then LAMMPS will automatically assign one MPI per GPU or do I need another command to specify that too?
I have already answered this. If you need more details, please study the LAMMPS documentation. There is a long section with lots of details about running with KOKKOS.
If I read correctly, you told me that I need 1 MPI task per GPU, you did not specify if LAMMPS is going to detect it automatically or if I need to specify that in a command line.
Thanks for the help!
As I already stated, please read the documentation. That is why we write it. I don’t like being used as a quasi-chatbot.
The number of MPI tasks (in total and per node) is determined by your batch system and/or your mpirun/mpiexec/srun command. The rest follows logically.
If you use SLURM queuing system (and admins allow it), you can log in to a server with a running job and check GPU utilization.
Command for logging in to a job: srun --pty --overlap --jobid $JOBID bash
where $JOBID is the id of your job.
Then you can use nvidia-smi command to check GPU utilization: nvidia-smi -l 1 --format=csv --query-gpu=memory.used,utilization.gpu
This will poll GPU utilization and used GPU memory each second.
This approach has a disadvantage - you will see only GPUs you have access to on a single node, so if you have a multi-node job, you can check only a part of GPUs.
FWIW, the same exists for PBS/Torque clusters, too. It is even simpler, because you can ssh passwordless into any node on which your job is running while it is running.