No speed-up using several GPUs with KOKKOS

Oystein · February 11, 2025, 11:17am

Hello

I am simulating a system of LiTFSI + urea with an Allegro machine learning model (https://github.com/mir-group/allegro). I have compiled LAMMPS with the pair_allegro style (https://github.com/mir-group/pair_allegro) and with KOKKOS to get performance increase. The LAMMPS version is 29 Aug 2024. I use one Nvidia H100 GPU for running the simulations. When I try to use two or more H100 GPUs, there is no performance increase (or decrease) which I find a bit surprising. Am I doing something wrong or is this as expected?

Here is the slurm input file for a test run with 3 GPUs (minimum example):

#!/bin/sh
#SBATCH --partition=GPUQ
#SBATCH --gres=gpu:h100:3       # specify number and type of GPUs
#SBATCH --time=0-00:25:00

# Set environment variables for KOKKOS
export OMP_PROC_BIND=spread
export OMP_PLACES=threads

cd ${SLURM_SUBMIT_DIR}
mkdir ${SLURM_JOBID}

module purge
source ~/.virtualenvs/allegro/bin/activate

module load foss/2024a
module load CUDA/12.4.0

export PYTORCH_JIT_USE_NNC_NOT_NVFUSER=1

# Run with installed version of KOKKOS lammps
srun /cluster/home/oystegul/lammps_allegro_test/build/lmp -sf kk -k on g 3 -pk kokkos newton on neigh full -in test.lmp

#SBATCH --gres=gpu:h100:1 when I use one GPU, and of course -k on g 1 in the srun line.

The output from the top of log.lammps:

LAMMPS (29 Aug 2024)
KOKKOS mode with Kokkos version 4.3.1 is enabled (src/KOKKOS/kokkos.cpp:72)
  will use up to 3 GPU(s) per node
WARNING: When using a single thread, the Kokkos Serial backend (i.e. Makefile.kokkos_mpi_only) gives better performance than the OpenMP backend (src/KOKKOS/kokkos.cpp:202)
  using 1 OpenMP thread(s) per MPI task
package kokkos
package kokkos newton on neigh full

...

Reading data file ...
  triclinic box = (0 0 0) to (40.326 40.326 40.326) with tilt (0 0 0)
  1 by 1 by 1 MPI processor grid
  reading atoms ...
  5632 atoms
  reading velocities ...
  5632 velocities
  read_data CPU = 0.032 seconds

end of log.lammps:

Loop time of 1011.52 on 1 procs for 10000 steps with 5632 atoms

Performance: 0.427 ns/day, 56.196 hours/ns, 9.886 timesteps/s, 55.678 katom-step/s
99.2% CPU use with 1 MPI tasks x 1 OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 1004.9     | 1004.9     | 1004.9     |   0.0 | 99.34
Neigh   | 0.17765    | 0.17765    | 0.17765    |   0.0 |  0.02
Comm    | 0.54798    | 0.54798    | 0.54798    |   0.0 |  0.05
Output  | 0.032691   | 0.032691   | 0.032691   |   0.0 |  0.00
Modify  | 5.7022     | 5.7022     | 5.7022     |   0.0 |  0.56
Other   |            | 0.1867     |            |       |  0.02

Nlocal:           5632 ave        5632 max        5632 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Nghost:           9794 ave        9794 max        9794 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Neighs:         517881 ave      517881 max      517881 min
Histogram: 1 0 0 0 0 0 0 0 0 0
FullNghs:  1.03585e+06 ave 1.03585e+06 max 1.03585e+06 min
Histogram: 1 0 0 0 0 0 0 0 0 0

Total # of neighbors = 1035848
Ave neighs/atom = 183.92188
Neighbor list builds = 236
Dangerous builds = 0

akohlmey · February 11, 2025, 11:40am

This is a tiny number of atoms. GPUs cannot be efficient for this since there are not enough work units to run concurrently.

Please see LAMMPS Benchmarks for some (rather old) benchmark numbers comparing CPU and GPU performance. As you can see, you need to have a system that has two to three orders of magnitude more atoms to get significant speedup with a GPU versus a CPU. With current “Hopper” generation GPUs compared to the “Kepler” GPUs in the benchmark, the system size requirement to properly utilize the GPUs are even larger. Same for running across multiple GPUs.

Oystein · February 11, 2025, 11:54am

Thanks for the reply!

Yes, I agree that the system is small. However, the simulation speed is still quite slow (0.427 ns/day) and pretty much all of the GPU resources go to evaluating the machine learning (ML) potential of the system (Pair: 99.34 %). So I thought that perhaps still using several GPUs might speedup the ML model evaluation. But this is apparently not the case?

akohlmey · February 11, 2025, 12:12pm

You can monitor the GPU utilization with the nvidia-smi command (you may need to request an interactive queue slot or log in a second time to the node running the job). Only if the GPU utilization is high, there is a chance to improve performance by using a second GPU.

You didn’t say which pair style you are using.

srtee · February 11, 2025, 1:21pm

Note that the Allegro machine learning potentials are maintained separately at GitHub - mir-group/pair_allegro: LAMMPS pair style for Allegro deep learning interatomic potentials with parallelization support . LAMMPS developers will not necessarily have experience with the best performance settings for their code.

Additionally, you should also check with your local cluster administrators what is the best set of settings to use. You may even be incurring large parallelization overheads trying to spread your system across too many GPUs. (This will not show up in the LAMMPS timing breakdown because, from LAMMPS’s point of view, all the calculations are occurring inside the ML potential evaluation, and LAMMPS has no additional visibility into data transfer between GPUs and other possible overheads.)

If still in doubt, you should be able to SSH into a node during calculations and use the nvidia-smi utility to inspect GPU utils during your run. You should definitely get advice from your cluster administrator on how to do this and how to interpret the results.

Oystein · February 11, 2025, 2:22pm

Thanks both, very helpful indeed!

The pair style is called allegro; pair_style allegro.

I am running a job now, and from the nvidia-smi output it is very easy to understand why the performance does not increase when using two GPUs, it is because only one is actually running! The GPU-Util is 0 % for the second GPU. Do you have any idea why this is the case?

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.06              Driver Version: 555.42.06      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          On  |   00000000:19:00.0 Off |                    0 |
| N/A   62C    P0            495W /  700W |   37140MiB /  81559MiB |     99%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          On  |   00000000:CB:00.0 Off |                    0 |
| N/A   37C    P0             74W /  700W |       4MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     99022      C   ...tegul/lammps_allegro_test/build/lmp      37130MiB |
+-----------------------------------------------------------------------------------------+

I might check at the pair_allegro Github page, as well, srtee.

akohlmey · February 11, 2025, 2:51pm

Your original log file has:

LAMMPS will not use more than one GPU per MPI rank (note the message will use **up to** 3 GPU(s) per node), so you will need to use at least 2 MPI processes to use 2 GPUs.

Oystein · February 12, 2025, 8:16am

Thanks for the direction! You are correct, Axel.

This turned out to be a bit more involved than I thought. I had already tried to start a 2 GPU job with 2 tasks (MPI processes), but for some reason two parallell jobs was started (two singletons), and this error message at the top of the output:

No PMIx server was reachable, but a PMI1/2 was detected.
If srun is being used to launch application, 2 singletons will be started.

After some googling, I found this site https://github.com/open-mpi/ompi/issues/10286 which discusses this issue. The fix is to add --mpi=pmix to the srun line. When I did that the jobs crashed with segmentation fault:

Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x42024bc80)

This issue is discussed in the KOKKOS package page in the LAMMPS documentation under “CUDA and MPI library compatibility”, and the reason is that the MPI-library on the cluster I am using is not GPU-aware. The fix is to add gpu/aware off to the command line.

Finally, I ended up with this command line for running LAMMPS:
srun --mpi=pmix /cluster/home/oystegul/lammps_allegro_test/build/lmp -sf kk -k on g 2 -pk kokkos gpu/aware off newton on neigh full -in test.lmp

for a 2 GPU job and now it works!

Preliminary testing suggest that the scaling to several GPUs is not half bad, the performance was 0.427 ns/day with 1 GPU, 0.800 ns/day with 2 GPUs, and 1.137 ns/day with 3 GPUs (H100). Going beyond 3 GPUs it seems to start plateauing.
Thanks for all the help!

akohlmey · February 12, 2025, 8:20am

You should get even better performance, when compiling/using a GPU-aware MPI library.
Perhaps, it is worth notifying your HPC admins about that.

This reminds me of a theorem formulated by my PhD adviser and a co-worker of his: the closer you look, the worse it gets.