Lammps with AMD Mi300x GPUs

I am trying to install Lammps on a node of a HPC cluster with HIP for running on AMD Mi300x GPUs. I can build the code (single node with hip) with the clang compiler and MPI_ON. The code runs fine in serial and seems quick, however it only runs on one GPU, even with -sf gpu -pk gpu 8 and the relevant parameter(s) in the input file. If I run with MPI I can utilise more than one GPU, however, the code slows down greatly. Has anyone seen a similar issue?
I don’t like the clang compiler personally, but I understand for hip this is required. Is there a way to build Lammps with say the g++ or mpicc compiler (and the clang compiler for the necessary files)?
Many thanks

@brianc87 As I already wrote in response to the email that you wrote to the LAMMPS developers directly, we need to know exactly which commands you used to configure and and build LAMMPS with CMake, or which commands and makefile settings you used to build with the legacy build method. We also need to know exactly which version of LAMMPS you are trying to compile.

It would also be very helpful to see the input and system that you are trying to run with GPU acceleration and what performance you get.

Some general information:

  • there are two ways to include GPU acceleration in LAMMPS, the GPU package and the KOKKOS package. Both support (some) AMD GPUs.
  • you cannot use more than one GPU per MPI task
  • with the GPU package it can be helpful to use multiple MPI tasks for each GPU, since it accelerates only the pair style, while for KOKKOS ideally all computations happen on the GPU and thus there are rarely benefits from oversubscribing GPUs, if at all.
  • how much GPU acceleration you get strongly depends on the size of the system, the choice of cutoff (if available), and the general computational effort of the selected pair style. You may need several thousands of atoms per GPU to have a good GPU utilization, while with CPU-only code you often scale out with only a few hundreds of atoms per CPU.
  • for the GPU package, it is quite possible to compile LAMMPS mostly with GCC and compile only the GPU support library with hipcc
  • for KOKKOS, there there is a CMake preset file in cmake/presets/kokkos-hip.cmake that would need to be edited for your GPU architecture according to 3.7. Packages with extra build options — LAMMPS documentation
  • The GPU package can be compiled for all single precision, mixed precision, or all double precistion. The KOKKOS package currently only supports all double precision.

As a point of reference. My desktop has an AMD CPU with embedded GPU (AMD Ryzen 7 7840HS w/ Radeon 780M Graphics). I use CMake exclusively to configure and compile LAMMPS. For the GPU package, it is quite possible to compile with GCC and then only use HIPCC for the GPU library. This is done by using the gcc preset and -DPKG_GPU=on -D GPU_API=hip -D GPU_PREC=mixed
My CMake summary for that is:

-- <<< Build configuration >>>
   LAMMPS Version:   20250204 patch_4Feb2025-3-g04cad88b55-modified
   Operating System: Linux Fedora 41
   CMake Version:    3.30.7
   Build type:       RelWithDebInfo
   Install path:     /home/akohlmey/.local
   Generator:        Unix Makefiles using /usr/bin/gmake
-- Enabled packages: AMOEBA;ASPHERE;AWPMD;BOCS;BODY;BPM;BROWNIAN;CG-DNA;CG-SPICA;CLASS2;COLLOID;COLVARS;COMPRESS;CORESHELL;DIELECTRIC;DIFFRACTION;DIPOLE;DPD-BASIC;DPD-MESO;DPD-REACT;DPD-SMOOTH;DRUDE;EFF;ELECTRODE;EXTRA-COMMAND;EXTRA-COMPUTE;EXTRA-DUMP;EXTRA-FIX;EXTRA-MOLECULE;EXTRA-PAIR;FEP;GPU;GRANULAR;H5MD;INTEL;INTERLAYER;KIM;KOKKOS;KSPACE;LATBOLTZ;LEPTON;MACHDYN;MANIFOLD;MANYBODY;MC;MDI;MEAM;MESONT;MGPT;MISC;ML-HDNNP;ML-IAP;ML-PACE;ML-POD;ML-QUIP;ML-RANN;ML-SNAP;ML-UF3;MOFFF;MOLECULE;MOLFILE;NETCDF;OPENMP;OPT;ORIENT;PERI;PHONON;PLUGIN;PLUMED;POEMS;PTM;PYTHON;QEQ;QTB;REACTION;REAXFF;REPLICA;RHEO;RIGID;SHOCK;SMTBQ;SPH;SPIN;SRD;TALLY;UEF;VORONOI;YAFF
-- <<< Compilers and Flags: >>>
-- C++ Compiler:     /usr/lib64/ccache/g++
      Type:          GNU
      Version:       14.2.1
      C++ Standard:  17
      C++ Flags:     -g -O2 -DNDEBUG -Wvla
      Defines:       LAMMPS_SMALLBIG;LAMMPS_MEMALIGN=64;LAMMPS_OMP_COMPAT=4;LAMMPS_JPEG;LAMMPS_PNG;LAMMPS_GZIP;LAMMPS_FFMPEG;FFT_KISS;LMP_PYTHON;MLIAP_PYTHON;LMP_MDI;LMP_HAS_NETCDF;LMP_HAS_PNETCDF;NC_64BIT_DATA=0x0020;EIGEN_NO_CUDA;LMP_KIM_CURL;LAMMPS_ZSTD;LAMMPS_CURL;LAMMPS_ASYNC_IMD;LMP_OPENMP;$<BUILD_INTERFACE:LMP_KOKKOS>;FFT_KOKKOS_KISS;LMP_INTEL;LMP_INTEL_USELRT;LMP_GPU;LMP_PLUGIN
-- Fortran Compiler: /usr/bin/gfortran
      Type:          GNU
      Version:       14.2.1
      Fortran Flags: -g -O2 -DNDEBUG -std=f2003
-- C compiler:       /usr/lib64/ccache/gcc
      Type:          GNU
      Version:       14.2.1
      C Flags:       -g -O2 -DNDEBUG
-- <<< Linker flags: >>>
-- Executable name:  lmp
-- Linker options:   -fuse-ld=mold
-- Shared library flags:    
-- <<< MPI flags >>>
-- MPI_defines:      MPICH_SKIP_MPICXX;OMPI_SKIP_MPICXX;_MPICC_H
-- MPI includes:     /usr/include/mpich-x86_64
-- MPI libraries:    /usr/lib64/mpich/lib/libmpicxx.so;/usr/lib64/mpich/lib/libmpi.so;
-- <<< GPU package settings >>>
-- GPU API:                  HIP
-- HIP platform:     amd
-- HIP architecture: gfx1103
-- HIP GPU sorting: off
-- GPU precision:            MIXED
-- Kokkos Devices: OPENMP;SERIAL
-- Kokkos Architecture: AMDAVX
-- <<< FFT settings >>>
-- Primary FFT lib:  KISS
-- Using double precision FFTs
-- Using threaded FFTs
-- Using builtin distributed FFT algorithms
-- Kokkos FFT: KISS
-- <<< Building Tools >>>
-- <<< Building LAMMPS-GUI >>>
-- Linking LAMMPS library at compile time
-- <<< Building WHAM >>>
-- <<< Building Unit Tests >>>
-- Configuring done (16.2s)
-- Generating done (1.6s)

But I also can compile for KOKKOS/HIP using the aforementioned preset (edited for my specific CPU/GPU to select -D Kokkos_ARCH_AMDAVX=on -D Kokkos_ARCH_AMD_GFX1103=on). Here the resulting CMake summary is:

-- <<< Build configuration >>>
   LAMMPS Version:   20250204 patch_4Feb2025-3-g04cad88b55-modified
   Operating System: Linux Fedora 41
   CMake Version:    3.30.7
   Build type:       RelWithDebInfo
   Install path:     /home/akohlmey/.local
   Generator:        Ninja using /usr/bin/ninja-build
-- Enabled packages: AMOEBA;ASPHERE;ATC;AWPMD;BOCS;BODY;BPM;BROWNIAN;CG-DNA;CG-SPICA;CLASS2;COLLOID;COLVARS;COMPRESS;CORESHELL;DIELECTRIC;DIFFRACTION;DIPOLE;DPD-BASIC;DPD-MESO;DPD-REACT;DPD-SMOOTH;DRUDE;EFF;ELECTRODE;EXTRA-COMMAND;EXTRA-COMPUTE;EXTRA-DUMP;EXTRA-FIX;EXTRA-MOLECULE;EXTRA-PAIR;FEP;GPU;GRANULAR;H5MD;INTERLAYER;KIM;KOKKOS;KSPACE;LATBOLTZ;LEPTON;MACHDYN;MANIFOLD;MANYBODY;MC;MDI;MEAM;MESONT;MGPT;MISC;ML-IAP;ML-PACE;ML-POD;ML-RANN;ML-SNAP;ML-UF3;MOFFF;MOLECULE;NETCDF;OPENMP;OPT;ORIENT;PERI;PHONON;PLUGIN;POEMS;PTM;PYTHON;QEQ;QTB;REACTION;REAXFF;REPLICA;RHEO;RIGID;SHOCK;SMTBQ;SPH;SPIN;SRD;TALLY;UEF;VORONOI;YAFF
-- <<< Compilers and Flags: >>>
-- C++ Compiler:     /usr/bin/hipcc
      Type:          Clang
      Version:       18.1.8
      C++ Standard:  17
      C++ Flags:    -I/usr/lib/clang/17/include -Wall -Wextra -g -O2 -DNDEBUG -Wno-unused-parameter
      Defines:       LAMMPS_SMALLBIG;LAMMPS_MEMALIGN=64;LAMMPS_OMP_COMPAT=4;LAMMPS_JPEG;LAMMPS_PNG;LAMMPS_GZIP;LAMMPS_FFMPEG;FFT_FFTW3;FFT_FFTW_THREADS;LMP_PYTHON;LMP_MDI;LMP_HAS_NETCDF;LMP_HAS_PNETCDF;NC_64BIT_DATA=0x0020;EIGEN_NO_CUDA;LMP_KIM_CURL;LAMMPS_ZSTD;LAMMPS_CURL;LMP_OPENMP;$<BUILD_INTERFACE:LMP_KOKKOS>;FFT_KOKKOS_HIPFFT;LMP_GPU;LMP_PLUGIN
-- Fortran Compiler: /usr/bin/gfortran
      Type:          GNU
      Version:       14.2.1
      Fortran Flags: -Wall -Wextra -g -O2 -DNDEBUG -std=f2003
-- C compiler:       /usr/lib64/ccache/gcc
      Type:          GNU
      Version:       14.2.1
      C Flags:       -Wall -Wextra -g -O2 -DNDEBUG -Wno-unused-parameter
-- <<< Linker flags: >>>
-- Executable name:  lmp
-- Linker options:   -fuse-ld=mold
-- Shared library flags:    
-- <<< MPI flags >>>
-- MPI_defines:      MPICH_SKIP_MPICXX;OMPI_SKIP_MPICXX;_MPICC_H
-- MPI includes:     /usr/include/mpich-x86_64
-- MPI libraries:    /usr/lib64/mpich/lib/libmpicxx.so;/usr/lib64/mpich/lib/libmpi.so;
-- <<< GPU package settings >>>
-- GPU API:                  HIP
-- HIP platform:     amd
-- HIP architecture: gfx1103
-- HIP GPU sorting: on
-- GPU precision:            MIXED
-- Kokkos Devices: HIP;OPENMP;SERIAL
-- Kokkos Architecture: AMDAVX;AMD_GFX1103;NAVI1103
-- <<< FFT settings >>>
-- Primary FFT lib:  FFTW3
-- Using double precision FFTs
-- Using threaded FFTs
-- Using builtin distributed FFT algorithms
-- Kokkos FFT: HIPFFT
-- <<< Building Tools >>>
-- <<< Building LAMMPS-GUI >>>
-- Loading LAMMPS library as plugin at run time
-- <<< Building WHAM >>>
-- <<< Building Unit Tests >>>
-- Configuring done (10.4s)
-- Generating done (0.3s)

Finally for some benchmark numbers. Since I have an embedded GPU, the GPU acceleration is limited. When I run on a single CPU core with mpirun -np 1 ./lmp -in ../bench/in.lj -v x 4 -v y 4 -v z 4 , the simulation time is

Loop time of 57.7539 on 1 procs for 100 steps with 2048000 atoms

With all 8 CPU cores mpirun -np 8 ./lmp -in ../bench/in.lj -v x 4 -v y 4 -v z 4 it comes down to:

Loop time of 10.3253 on 8 procs for 100 steps with 2048000 atoms

With one CPU and one GPU mpirun -np 1 ./lmp -in ../bench/in.lj -v x 4 -v y 4 -v z 4 -sf gpu I get:

Loop time of 6.79718 on 1 procs for 100 steps with 2048000 atoms

Using the KOKKOS package with the corresponding command mpirun -np 1 ./lmp -in ../bench/in.lj -v x 4 -v y 4 -v z 4 -kokkos on g 1 -sf kk I get:

Loop time of 8.54916 on 1 procs for 100 steps with 2048000 atoms

In this particular case, the GPU package is a bit faster because it was compile in mixed precision and thus trades off a bit of accuracy for faster computation, while KOKKOS runs in double precision.