Reproducing NVIDIA benchmark

martin_12345 · July 18, 2024, 10:24am

Hi,
I am working on a GPU implementation of molecular dynamics using LJ potentials. I want to use lammps for comparison. As a first step I want to make sure that I built lammps correctly and that I get the best performance. I am using nodes with 8 A100 GPUs (40GB) and two EPYC CPUs. The GPUs are interconnected (all-to-all) with nv12.

I am looking at “LAMMPS [LJ 2.5] ATOM-Time Steps/s” in this table:

The expected performance is 4.05e9 for 8xA100 (80GB).

I have executed the benchmark as described here:

But the performance is only 2.3e9. Could this be due to the different GPU model (40GB vs. 80GB)? The 80GB model has more memory bandwidth.

I am using one MPI process per GPU, correct binding to GPU, CPU cores and network device.
OpenMPI and ucx are built with CUDA support and I’m sure that GPU aware MPI is working. However, lammps complains “WARNING: Turning off GPU-aware MPI since it is not detected, use ‘-pk kokkos gpu/aware on’ to override (src/KOKKOS/kokkos.cpp:291)”

If I set “-pk kokkos gpu/aware off” performance reduces drastically.

The configure line for lammps was:
cmake -C …/cmake/presets/basic.cmake -D PKG_ADIOS=off -D PKG_GPU=on -D PKG_KOKKOS=on -D GPU_API=cuda -D GPU_ARCH=sm_80 -D Kokkos_ARCH_ZEN2=yes -D Kokkos_ARCH_AMPERE80=yes -D Kokkos_ENABLE_CUDA=yes -D Kokkos_ENABLE_OPENMP=yes …/cmake

Is there something I am doing wrong?
Greetings,
Martin

stamoor · July 19, 2024, 9:51am

Performance is highly dependent on system size. How many atoms are you running? I believe the NVIDIA benchmarks used 8 million atoms (strong scaled) across the GPUs. Can you reproduce the 1 GPU benchmark number? Should be much easier and doesn’t depend on MPI, that would be the first step, then can go from there.