Enabling SIMD in AMD CPUs

Dear Lammps user,
I have installed the 23Jun2022 version of LAMMPS on two differnet machines. The first one has an Intel CPU and the CPU on the second one is an AMD.
Running my simulation with the same number of threads, the intel machine is four times faster. I suspect it is due to SIMD instructions in the intel package.
I was wondering if there is anyway that I can reach the same speed up on the machine with an AMD CPU?

Any help is highly appreciated in advance.
Mahdi

These are the specifications of the CPUs and how I install LAMMPS on each machine:

Machine-1
CPU: Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz
LAMMPS version: 23 Jun 2022
Installation:

install.sh {
#!/bin/sh
rm -rf build-intel-most5
mkdir build-intel-most5
cd build-intel-most5
cmake -C …/cmake/presets/intel.cmake -C …/cmake/presets/most4.cmake -D BUILD_MPI=yes -D FFT=MKL -D FFT_SINGLE=yes -D INTEL_LRT_MODE=c++11 …/cmake
cmake --build . --parallel
}
most4.cmake {

preset that turns on a wide range of packages, some of which require

external libraries. Compared to all_on.cmake some more unusual packages

are removed. The resulting binary should be able to run most inputs.

set(ALL_PACKAGES
CORESHELL
INTEL
GPU
ASPHERE
BODY
BROWNIAN
EXTRA-PAIR
DIELECTRIC
DIPOLE
DRUDE
FEP
GRANULAR
INTERLAYER
KSPACE
MANYBODY
MISC
MOLECULE
QEQ
REACTION
REAXFF
REPLICA
RIGID
OPENMP
EXTRA-FIX
EXTRA-DUMP)

foreach(PKG {ALL_PACKAGES}) set(PKG_{PKG} ON CACHE BOOL “” FORCE)
endforeach()

set(BUILD_TOOLS ON CACHE BOOL “” FORCE)
}

intel.cmake {

preset that will enable Intel compilers with support for MPI and OpenMP (on Linux boxes)

set(CMAKE_CXX_COMPILER “icpc” CACHE STRING “” FORCE)
set(CMAKE_C_COMPILER “icc” CACHE STRING “” FORCE)
set(CMAKE_Fortran_COMPILER “ifort” CACHE STRING “” FORCE)
set(CMAKE_CXX_FLAGS_DEBUG “-Wall -Wextra -g” CACHE STRING “” FORCE)
set(CMAKE_CXX_FLAGS_RELWITHDEBINFO “-Wall -Wextra -g -O2 -DNDEBUG” CACHE STRING “” FORCE)
set(CMAKE_CXX_FLAGS_RELEASE “-O3 -DNDEBUG” CACHE STRING “” FORCE)
set(CMAKE_Fortran_FLAGS_DEBUG “-Wall -Wextra -g” CACHE STRING “” FORCE)
set(CMAKE_Fortran_FLAGS_RELWITHDEBINFO “-Wall -Wextra -g -O2 -DNDEBUG” CACHE STRING “” FORCE)
set(CMAKE_Fortran_FLAGS_RELEASE “-O3 -DNDEBUG” CACHE STRING “” FORCE)
set(CMAKE_C_FLAGS_DEBUG “-Wall -Wextra -g” CACHE STRING “” FORCE)
set(CMAKE_C_FLAGS_RELWITHDEBINFO “-Wall -Wextra -g -O2 -DNDEBUG” CACHE STRING “” FORCE)
set(CMAKE_C_FLAGS_RELEASE “-O3 -DNDEBUG” CACHE STRING “” FORCE)

set(MPI_CXX “icpc” CACHE STRING “” FORCE)
set(MPI_CXX_COMPILER “mpicxx” CACHE STRING “” FORCE)

unset(HAVE_OMP_H_INCLUDE CACHE)
set(OpenMP_C “icc” CACHE STRING “” FORCE)
set(OpenMP_C_FLAGS “-qopenmp” CACHE STRING “” FORCE)
set(OpenMP_C_LIB_NAMES “omp” CACHE STRING “” FORCE)
set(OpenMP_CXX “icpc” CACHE STRING “” FORCE)
set(OpenMP_CXX_FLAGS “-qopenmp” CACHE STRING “” FORCE)
set(OpenMP_CXX_LIB_NAMES “omp” CACHE STRING “” FORCE)
set(OpenMP_Fortran_FLAGS “-qopenmp” CACHE STRING “” FORCE)
set(OpenMP_omp_LIBRARY “libiomp5.so” CACHE PATH “” FORCE)
}

Machine-2
CPU: AMD EPYC 7702
LAMMPS version 22Jun2023

install.sh {
#!/bin/sh
module load oneapi
module load intelmpi/2021.6
module load mkl/2019.6

rm -rf build-most9
mkdir build-most9
cd build-most9
cmake -C …/cmake/presets/kokkos-amd.cmake -C …/cmake/presets/most4.cmake -D BUILD_MPI=yes -D FFT_SINGLE=yes …/cmake
cmake --build . --parallel
}

kokkos-amd.cmake {
set(PKG_KOKKOS ON CACHE BOOL “” FORCE)
set(Kokkos_ARCH_ZEN2 ON CACHE BOOL “” FORCE)
set(BUILD_OMP ON CACHE BOOL “” FORCE)

hide deprecation warnings temporarily for stable release

set(Kokkos_ENABLE_DEPRECATION_WARNINGS OFF CACHE BOOL “” FORCE)

Enable OpenMP execution space

set(Kokkos_ENABLE_OPENMP ON CACHE BOOL “” FORCE)
}

most4.cmake {

preset that turns on a wide range of packages, some of which require

external libraries. Compared to all_on.cmake some more unusual packages

are removed. The resulting binary should be able to run most inputs.

set(ALL_PACKAGES
CORESHELL
INTEL
ASPHERE
BODY
BROWNIAN
EXTRA-PAIR
DIELECTRIC
DIPOLE
DRUDE
FEP
GRANULAR
INTERLAYER
KSPACE
MANYBODY
MISC
MOLECULE
QEQ
REACTION
REAXFF
REPLICA
RIGID
OPENMP
EXTRA-FIX
EXTRA-DUMP)

foreach(PKG {ALL_PACKAGES}) set(PKG_{PKG} ON CACHE BOOL “” FORCE)
endforeach()

set(BUILD_TOOLS ON CACHE BOOL “” FORCE)

}

It is very difficult to provide any useful assistance with the incomplete, badly formatted, and overall inconsistent information provided.

In order to understand what is going on (it is a very bad idea to speculate without having any suitable data), you first need to have a “baseline” and have data that is easily reproducible and comparable. Then you can incrementally make changes (and consistently so on both systems) in order to determine the origin of the unexpected performance anomaly (if there is any).

To get a baseline, you can download the pre-compiled static executable from here: LAMMPS Static Linux Binary Download Repository: .
With that you can run the benchmark inputs in the bench folder with standard setting, e.g.
lmp -in in.lj or lmp -in in.rhodo or lmp -in in.eam. That will give you base timings for serial execution with the same executable on both machines.

Then you should compile LAMMPS on both machines with the exact same minimal settings, e.g. configure it with the gcc (default) compiler, only include the “basic” preset, and -DCMAKE_BUILD_TYPE=Release. With this you run the same serial runs as with the static executable before and compare how the performance changes (if at all).

Then you should compile LAMMPS with the Intel compiler on both machines and compare the (serial) results again.

Then you should add the INTEL package on both machines (and nothing else) and run the bench examples with only -sf intel added to the command line, and record the performance again and compare.

This is all serial performance. Once you add threading to the picture, you have to watch out for if and how threads are bound to processor cores and whether any of the hardware has hyper-threading enabled. Some MPI libraries may be configured to bind threads to a specific core to optimize for MPI-only execution (which is the most common use). So you may have to use mpirun -np 1 lmp (if MPI is enabled in the executable) with additional settings. E.g. for OpenMPI with threading to 2 threads, I generally use something like mpirun -np 1 -x OMP_NUM_THREADS=2 --bind-to socket lmp -in in.input -sf intel. You have to consult your MPI library documentation to figure out the details that apply to your MPI variant.
With threads you can then do a scaling test with 1, 2, 4, 8 threads. LAMMPS, however, generally performs better on dense bulk systems with MPI parallelization than with threads and the thread overhead grows with the number of threads, so it is usually best to only use a moderate number of threads, if at all.

If you have collected all information as outlined and thus can provide a more comprehensive and consistent picture of what the performance of those two machines looks like, we can start discussing, which settings/flags can be used or changed in addition to improve the performance.

As Axel has said, it’s hard to tell the root cause of performance difference without more info given. Here are some comments from my own experience:

  1. if everything you’re going to use is supported by the INTEL package, you may want to use the compile procedure (not the binary) for Intel CPUs on an AMD CPU, too. It will not perform as well as on an Intel CPU, but could still be faster than other compiling procedures.
  2. when compiling on AMD CPU, you may want to change the default -xHost compile option with -xCORE-AVX or -xCORE-AVX2. Sometimes -xHost cannot identify the instruction set supported by an AMD CPU correctly, resulting in the SIMD not enabled or segfault.
  3. MKL is known to perform worse on AMD CPUs, and depends on the version you’re using the difference can be huge (you can search on the Internet). It may or may not be an issue depends on what you’re doing with LAMMPS.
  4. I believe AMD EPYC 7702 does not support AVX512, and I’m not sure that AVX2 could work properly (I’ve heard that Intel and AMD has slightly different implementations for AVX2, idk). It could result in a huge difference in performance. I won’t be surprised if there’s still a 2x difference with same # of cores, even if everything is set up in their best way.