USER-INTEL underperforming on Cascade Lake

I have gotten significant speed-ups on Skylake CPUs with USER-INTEL. However, I’m finding the performance on Cascade Lake CPUs is about 40% slower for a Lennard-Jones fluid and a large solvated peptide system. These are single node simulations on different Linux clusters running RHEL 7.8 and Slurm.

The host processor on the login node is Broadwell. Here is my CMake build (LAMMPS adds -xHost -qopenmp -restrict):

wget https://github.com/lammps/lammps/archive/patch_4Feb2020.tar.gz

module purge
module load intel/18.0/64/18.0.3.222
module load intel-mpi/intel/2018.3/64

cmake3 -D CMAKE_INSTALL_PREFIX=$HOME/.local -D LAMMPS_MACHINE=perseus_uintel -D ENABLE_TESTING=yes
-D BUILD_MPI=yes -D BUILD_OMP=yes -D CMAKE_CXX_COMPILER=icpc -D CMAKE_BUILD_TYPE=Release
-D CMAKE_CXX_FLAGS_RELEASE="-Ofast -axCORE-AVX512 -DNDEBUG"
-D PKG_USER-OMP=yes -D PKG_MOLECULE=yes -D PKG_RIGID=yes -D PKG_MISC=yes
-D PKG_KSPACE=yes -D FFT=MKL -D FFT_SINGLE=yes
-D PKG_USER-INTEL=yes -D INTEL_ARCH=cpu -D INTEL_LRT_MODE=threads …/cmake

make -j 10
make test
make install

Here is a different build using make:

wget https://github.com/lammps/lammps/archive/stable_3Mar2020.tar.gz
module load intel/19.1/64/19.1.1.217
module load intel-mpi/intel/2019.7/64

SHELL = /bin/sh

CC = mpicxx -std=c++11
OPTFLAGS = -xCORE-AVX512 -O3 -fp-model fast=2 -no-prec-div -qoverride-limits
-qopt-zmm-usage=high
CCFLAGS = -qopenmp -qno-offload -ansi-alias -restrict
-DLMP_INTEL_USELRT -DLMP_USE_MKL_RNG (OPTFLAGS) \ -I(MKLROOT)/include
SHFLAGS = -fPIC
DEPFLAGS = -M

LINK = mpicxx -std=c++11
LINKFLAGS = -qopenmp (OPTFLAGS) -L(MKLROOT)/lib/intel64/
LIB = -ltbbmalloc -lmkl_intel_lp64 -lmkl_sequential -lmkl_core
SIZE = size

ARCHIVE = ar
ARFLAGS = -rc
SHLIBFLAGS = -shared

Here is a sample Slurm script:

#!/bin/bash

#SBATCH --nodes=1

#SBATCH --ntasks=32

#SBATCH --mem=10G

#SBATCH --time=00:02:00

module load intel/19.1/64/19.1.1.217

module load intel-mpi/intel/2019.7/64

srun $HOME/.local/bin/lmp_cascade -sf omp -sf intel -in in.melt

Any thoughts on why the newer generation Intel processors are not performing here?

Jon

CCing Mike Brown at Intel who may have ideas or suggestions.

Steve

Hi Jon,

I haven’t seen any issues like this (assuming the core*freq ratio for the two processor SKUs is OK – within a generation there are a lot of different options with different performance).

For using multiple suffix strings, you might want to do “-sf hybrid intel omp” rather than the below if the intent to use both.

Comparing the timing output at the end of the log files can help. If there is big difference between the min and max timings for compute routines such as “Pair” (on a cascade lake log), there might be a CPU core affinity issue. If there is a big increase in Comm time from skylake to cascade lake, there might be issues with the MPI setup.

If you add the MANYBODY package to the build and run from USER-INTEL/TEST:

$ …/…/lmp_intel_cpu_intelmpi -in in.intel.water -log none -sf intel

You should see output to the screen verifying AVX-512 enabled build:

i definitely would suggest to check if other software is behaving as expected on the two kinds of hardware and/or check the processor frequency.
we had repeatedly issues on our local HPC cluster with newer CPU generations on CentOS 7.x due to the power management kernel module not allowing the frequency to increase beyond the minimum. it needed some tweaking of BIOS settings and using kernel command line parameters to get the behavior (powersaving when idle and maximum turbo-boost enabled performance within the allowed thermal limits).

axel.

Thanks Mike and Axel. The code is now giving the timings that one would expect on Cascade Lake versus Skylake with comparable specs. I won’t state here what was done for fear of confusing others.

Mike, thank you for your work on the package. It’s a great benefit to many.

Jon