I am simulating a system of LiTFSI + urea with an Allegro machine learning model (https://github.com/mir-group/allegro). I have compiled LAMMPS with the pair_allegro style (https://github.com/mir-group/pair_allegro) and with KOKKOS to get performance increase. The LAMMPS version is 29 Aug 2024. I use one Nvidia H100 GPU for running the simulations. When I try to use two or more H100 GPUs, there is no performance increase (or decrease) which I find a bit surprising. Am I doing something wrong or is this as expected?
Here is the slurm input file for a test run with 3 GPUs (minimum example):
#!/bin/sh
#SBATCH --partition=GPUQ
#SBATCH --gres=gpu:h100:3 # specify number and type of GPUs
#SBATCH --time=0-00:25:00
# Set environment variables for KOKKOS
export OMP_PROC_BIND=spread
export OMP_PLACES=threads
cd ${SLURM_SUBMIT_DIR}
mkdir ${SLURM_JOBID}
module purge
source ~/.virtualenvs/allegro/bin/activate
module load foss/2024a
module load CUDA/12.4.0
export PYTORCH_JIT_USE_NNC_NOT_NVFUSER=1
# Run with installed version of KOKKOS lammps
srun /cluster/home/oystegul/lammps_allegro_test/build/lmp -sf kk -k on g 3 -pk kokkos newton on neigh full -in test.lmp
#SBATCH --gres=gpu:h100:1 when I use one GPU, and of course -k on g 1 in the srun line.
The output from the top of log.lammps:
LAMMPS (29 Aug 2024)
KOKKOS mode with Kokkos version 4.3.1 is enabled (src/KOKKOS/kokkos.cpp:72)
will use up to 3 GPU(s) per node
WARNING: When using a single thread, the Kokkos Serial backend (i.e. Makefile.kokkos_mpi_only) gives better performance than the OpenMP backend (src/KOKKOS/kokkos.cpp:202)
using 1 OpenMP thread(s) per MPI task
package kokkos
package kokkos newton on neigh full
...
Reading data file ...
triclinic box = (0 0 0) to (40.326 40.326 40.326) with tilt (0 0 0)
1 by 1 by 1 MPI processor grid
reading atoms ...
5632 atoms
reading velocities ...
5632 velocities
read_data CPU = 0.032 seconds
end of log.lammps:
Loop time of 1011.52 on 1 procs for 10000 steps with 5632 atoms
Performance: 0.427 ns/day, 56.196 hours/ns, 9.886 timesteps/s, 55.678 katom-step/s
99.2% CPU use with 1 MPI tasks x 1 OpenMP threads
MPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %total
---------------------------------------------------------------
Pair | 1004.9 | 1004.9 | 1004.9 | 0.0 | 99.34
Neigh | 0.17765 | 0.17765 | 0.17765 | 0.0 | 0.02
Comm | 0.54798 | 0.54798 | 0.54798 | 0.0 | 0.05
Output | 0.032691 | 0.032691 | 0.032691 | 0.0 | 0.00
Modify | 5.7022 | 5.7022 | 5.7022 | 0.0 | 0.56
Other | | 0.1867 | | | 0.02
Nlocal: 5632 ave 5632 max 5632 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Nghost: 9794 ave 9794 max 9794 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Neighs: 517881 ave 517881 max 517881 min
Histogram: 1 0 0 0 0 0 0 0 0 0
FullNghs: 1.03585e+06 ave 1.03585e+06 max 1.03585e+06 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Total # of neighbors = 1035848
Ave neighs/atom = 183.92188
Neighbor list builds = 236
Dangerous builds = 0
This is a tiny number of atoms. GPUs cannot be efficient for this since there are not enough work units to run concurrently.
Please see LAMMPS Benchmarks for some (rather old) benchmark numbers comparing CPU and GPU performance. As you can see, you need to have a system that has two to three orders of magnitude more atoms to get significant speedup with a GPU versus a CPU. With current “Hopper” generation GPUs compared to the “Kepler” GPUs in the benchmark, the system size requirement to properly utilize the GPUs are even larger. Same for running across multiple GPUs.
Yes, I agree that the system is small. However, the simulation speed is still quite slow (0.427 ns/day) and pretty much all of the GPU resources go to evaluating the machine learning (ML) potential of the system (Pair: 99.34 %). So I thought that perhaps still using several GPUs might speedup the ML model evaluation. But this is apparently not the case?
You can monitor the GPU utilization with the nvidia-smi command (you may need to request an interactive queue slot or log in a second time to the node running the job). Only if the GPU utilization is high, there is a chance to improve performance by using a second GPU.
Additionally, you should also check with your local cluster administrators what is the best set of settings to use. You may even be incurring large parallelization overheads trying to spread your system across too many GPUs. (This will not show up in the LAMMPS timing breakdown because, from LAMMPS’s point of view, all the calculations are occurring inside the ML potential evaluation, and LAMMPS has no additional visibility into data transfer between GPUs and other possible overheads.)
If still in doubt, you should be able to SSH into a node during calculations and use the nvidia-smi utility to inspect GPU utils during your run. You should definitely get advice from your cluster administrator on how to do this and how to interpret the results.
The pair style is called allegro; pair_style allegro.
I am running a job now, and from the nvidia-smi output it is very easy to understand why the performance does not increase when using two GPUs, it is because only one is actually running! The GPU-Util is 0 % for the second GPU. Do you have any idea why this is the case?
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.06 Driver Version: 555.42.06 CUDA Version: 12.5 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 80GB HBM3 On | 00000000:19:00.0 Off | 0 |
| N/A 62C P0 495W / 700W | 37140MiB / 81559MiB | 99% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA H100 80GB HBM3 On | 00000000:CB:00.0 Off | 0 |
| N/A 37C P0 74W / 700W | 4MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 99022 C ...tegul/lammps_allegro_test/build/lmp 37130MiB |
+-----------------------------------------------------------------------------------------+
I might check at the pair_allegro Github page, as well, srtee.
LAMMPS will not use more than one GPU per MPI rank (note the message will use **up to** 3 GPU(s) per node), so you will need to use at least 2 MPI processes to use 2 GPUs.
This turned out to be a bit more involved than I thought. I had already tried to start a 2 GPU job with 2 tasks (MPI processes), but for some reason two parallell jobs was started (two singletons), and this error message at the top of the output:
No PMIx server was reachable, but a PMI1/2 was detected.
If srun is being used to launch application, 2 singletons will be started.
After some googling, I found this site https://github.com/open-mpi/ompi/issues/10286 which discusses this issue. The fix is to add --mpi=pmix to the srun line. When I did that the jobs crashed with segmentation fault:
Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x42024bc80)
This issue is discussed in the KOKKOS package page in the LAMMPS documentation under “CUDA and MPI library compatibility”, and the reason is that the MPI-library on the cluster I am using is not GPU-aware. The fix is to add gpu/aware off to the command line.
Finally, I ended up with this command line for running LAMMPS: srun --mpi=pmix /cluster/home/oystegul/lammps_allegro_test/build/lmp -sf kk -k on g 2 -pk kokkos gpu/aware off newton on neigh full -in test.lmp
for a 2 GPU job and now it works!
Preliminary testing suggest that the scaling to several GPUs is not half bad, the performance was 0.427 ns/day with 1 GPU, 0.800 ns/day with 2 GPUs, and 1.137 ns/day with 3 GPUs (H100). Going beyond 3 GPUs it seems to start plateauing.
Thanks for all the help!