Hi,
I am working on a GPU implementation of molecular dynamics using LJ potentials. I want to use lammps for comparison. As a first step I want to make sure that I built lammps correctly and that I get the best performance. I am using nodes with 8 A100 GPUs (40GB) and two EPYC CPUs. The GPUs are interconnected (all-to-all) with nv12.
I am looking at “LAMMPS [LJ 2.5] ATOM-Time Steps/s” in this table:
The expected performance is 4.05e9 for 8xA100 (80GB).
I have executed the benchmark as described here:
But the performance is only 2.3e9. Could this be due to the different GPU model (40GB vs. 80GB)? The 80GB model has more memory bandwidth.
I am using one MPI process per GPU, correct binding to GPU, CPU cores and network device.
OpenMPI and ucx are built with CUDA support and I’m sure that GPU aware MPI is working. However, lammps complains “WARNING: Turning off GPU-aware MPI since it is not detected, use ‘-pk kokkos gpu/aware on’ to override (src/KOKKOS/kokkos.cpp:291)”
If I set “-pk kokkos gpu/aware off” performance reduces drastically.
The configure line for lammps was:
cmake -C …/cmake/presets/basic.cmake -D PKG_ADIOS=off -D PKG_GPU=on -D PKG_KOKKOS=on -D GPU_API=cuda -D GPU_ARCH=sm_80 -D Kokkos_ARCH_ZEN2=yes -D Kokkos_ARCH_AMPERE80=yes -D Kokkos_ENABLE_CUDA=yes -D Kokkos_ENABLE_OPENMP=yes …/cmake
Is there something I am doing wrong?
Greetings,
Martin