LAMMPS/Kokkos: Performance discrepancy between src-built binary and NGC container

Hello,

According to NVIDIA, the nominal performance of HNS benchmark using 8 x A100-SMX4 is 2.44E+07 ATOM-Time Steps/s. (https://developer.nvidia.com/hpc-application-performance)

I was able to replicate this result using NGC containers, as follow:

mpirun \
  -np 8 \
  singularity \
    run \
      --nv 
      ./lammps_4May2022.sif \
      /usr/local/lammps/sm80/bin/lmp \
        -k on g 8 \
        -sf kk \
        -pk kokkos cuda/aware on neigh half comm device neigh/qeq full newton on \
        -var x 16 \
        -var y 16\
        -var z 16 \
        -var steps 1000 \
       -nocite \
       -in in.reaxc.hns \
       -log reacx-ngc.log

resulting in 2.66E+07 ATOM-Time Step/s, which could be attributed to usage of a different host CPU.

Then, I am trying to replicate this result with LAMMPS package built via Spack (0.19.1)

spack install lammps@20220504 %[email protected] target=zen3 +kspace +manybody +molecule +opt +openmp-package +openmp +reaxff +kokkos ^[email protected] +aggressive_vectorization +cuda cuda_arch=80 +cuda_lambda +cuda_ldg_intrinsic +cuda_uvm +wrapper

Here, Kokkos 3.7 was built with CUDA backend and addition functions such as UVM.
Other dependencies such as OpenMPI and UCX were also built as CUDA-awared libraries (+cuda)

^[email protected]%[email protected]~atomics+cuda~cxx~cxx_exceptions~gpfs~internal-hwloc~java+legacylaunchers~lustre~memchecker+romio+rsh~singularity+static+vt+wrapper-rpath build_system=autotools cuda_arch=none fabrics=ucx schedulers=none arch=linux-centos7-zen3

^[email protected]%[email protected]~assertions~backtrace_detail~cma+cuda+dc~debug~dm+examples~gdrcopy~ib_hw_tm~java+knem~logging~mlx5_dv+openmp+optimizations~parameter_checking+pic+rc~rdmacm~rocm+thread_multiple~ucg~ud~verbs~vfs~xpmem build_system=autotools cuda_arch=none libs=shared,static opt=3 patches=32fce32 simd=auto arch=linux-centos7-zen3

The command to run the benchmark was same as in NGC case.

mpirun \
  -np 8 \
  lmp \
    -k on g 8 \
    -sf kk \
    -pk kokkos cuda/aware on neigh half comm device neigh/qeq full newton on \
    -var x 16 \
    -var y 16 \
    -var z 16 \
    -var steps 1000 \
    -nocite \
    -in in.reaxc.hns \
    -log reacx-spack.log

This time, I obtained only 1.74E+7 ATOM-Time Step/s.

Since I am not yet allowed to upload file, here are links to the two outputs files:
reax-ngc.log
reacx-spack.log

Kokkos is open source and I don’t think NVIDIA has made addition optimizations to create such large performance gap. Any insights and suggestions are much appreciated.

Regards.

The Pair timings appear to be the same, the difference is from Comm and Modify. I suspect the issue is related to fix qeq, can you try redoing the benchmarks without it? You’ll need to add checkqeq no to thepair_style reaxff line.

You should be able to extract the compiler and OpenMPI library Nvidia used by running lmp -h.

In general UVM “on” is slower than “off”, though I doubt that would explain the whole performance difference.

Thanks for pointing out the difference in MPI timing.

The input file, per your suggestion is now

pair_style        reax/c NULL checkqeq no

However it doesn’t have much impact on performance in both cases.
Since I use the same inputs in both tests, I don’t think the above settings affects one but not the other.

The NGC’s lammps was built with the following parameters, via -h option:

OS: Linux "Ubuntu 20.04.4 LTS" 3.10.0-1160.36.2.el7.x86_64 x86_64

Compiler: GNU C++ 10.3.0 with OpenMP 4.5
C++ standard: C++14
MPI v3.1: Open MPI v4.1.3rc2, package: Open MPI root@6055a100160c Distribution, ident: 4.1.3rc2, repo rev: v4.1.3, Unreleased developer copy

Accelerator configuration:

KOKKOS package API: CUDA Serial
KOKKOS package precision: double
OPENMP package API: OpenMP
OPENMP package precision: double

Active compile time flags:

-DLAMMPS_GZIP
-DLAMMPS_SMALLBIG
sizeof(smallint): 32-bit
sizeof(imageint): 32-bit
sizeof(tagint):   32-bit
sizeof(bigint):   64-bit

Installed packages:

ASPHERE DPD-BASIC KOKKOS KSPACE MANYBODY MISC ML-SNAP MOLECULE MPIIO OPENMP
REAXFF REPLICA RIGID

For Spack, the built parameters are:

OS: Linux "CentOS Linux 7 (Core)" 3.10.0-1160.36.2.el7.x86_64 x86_64

Compiler: GNU C++ 11.3.0 with OpenMP 4.5
C++ standard: C++14
MPI v3.1: Open MPI v4.1.4, package: Open MPI optpar01@glogin01 Distribution, ident: 4.1.4, repo rev: v4.1.4, May 26, 2022

Accelerator configuration:

KOKKOS package API: CUDA Serial
KOKKOS package precision: double
OPENMP package API: OpenMP
OPENMP package precision: double

Active compile time flags:

-DLAMMPS_GZIP
-DLAMMPS_PNG
-DLAMMPS_JPEG
-DLAMMPS_FFMPEG
-DLAMMPS_SMALLBIG
sizeof(smallint): 32-bit
sizeof(imageint): 32-bit
sizeof(tagint):   32-bit
sizeof(bigint):   64-bit

Installed packages:

KIM KOKKOS KSPACE MANYBODY MOLECULE OPENMP OPT REAXFF

So with the exceptions of GCC’s verison, the built parameters are same.
Based on your comment, I will double check the performance of Open MPI between NGC and Spack.

Sorry for the confusion. To run simulations without fix qeq, you also need to comment out its line in the input file. checkqeq no is an additional requirement.

@mkanski
Thanks. My lack of experiences have caused confusion.
I have now disabled the qeq-related fix style

pair_style        reax/c NULL checkqeq no

fix               1 all nve
#fix               2 all qeq/reax 1 0.0 10.0 1e-6 reax/c 

The above indeed leads to better performance in both cases. For simplicity, I just listed here the timesteps/s
NGC: 21.2 (original), 47.6 (no fix qeq)
Spack: 13.4 (original), 38.83 (no fix qeq)

The culprit as stamoor have suggested is the inclusion of UVM.

@stamoor
Thanks for your suggestion. Indeed building Kokkos without UVM recovers the performance lost w.r.t to NGC container. I would not have expected such large penalty in the first place.

For reference, I listed here the full command to build lammps/kokkos with Spack.

spack install lammps@20220504 %[email protected] target=zen3 +kspace +manybody +molecule +opt +openmp-package +openmp +reaxff +kokkos \
    ^kokkos +wrapper +cuda cuda_arch=80 +cuda_lambda \ 
    ^openmpi +cuda +internal-hwloc +legacylaunchers fabrics=ucx \
        ^ucx +cuda +dc +dm +knem +rc

The performance is now 21.3 timesteps/s, in good agreement with NGC result.

If there is another comment related to LAMMPS/Kokkos, please let us now.

3 Likes