LAMMPS/Kokkos: Performance discrepancy between src-built binary and NGC container

ehrler · March 9, 2023, 7:01am

Hello,

According to NVIDIA, the nominal performance of HNS benchmark using 8 x A100-SMX4 is 2.44E+07 ATOM-Time Steps/s. (https://developer.nvidia.com/hpc-application-performance)

I was able to replicate this result using NGC containers, as follow:

mpirun \
  -np 8 \
  singularity \
    run \
      --nv 
      ./lammps_4May2022.sif \
      /usr/local/lammps/sm80/bin/lmp \
        -k on g 8 \
        -sf kk \
        -pk kokkos cuda/aware on neigh half comm device neigh/qeq full newton on \
        -var x 16 \
        -var y 16\
        -var z 16 \
        -var steps 1000 \
       -nocite \
       -in in.reaxc.hns \
       -log reacx-ngc.log

resulting in 2.66E+07 ATOM-Time Step/s, which could be attributed to usage of a different host CPU.

Then, I am trying to replicate this result with LAMMPS package built via Spack (0.19.1)

spack install lammps@20220504 %[email protected] target=zen3 +kspace +manybody +molecule +opt +openmp-package +openmp +reaxff +kokkos ^[email protected] +aggressive_vectorization +cuda cuda_arch=80 +cuda_lambda +cuda_ldg_intrinsic +cuda_uvm +wrapper

Here, Kokkos 3.7 was built with CUDA backend and addition functions such as UVM.
Other dependencies such as OpenMPI and UCX were also built as CUDA-awared libraries (+cuda)

^[email protected]%[email protected]~atomics+cuda~cxx~cxx_exceptions~gpfs~internal-hwloc~java+legacylaunchers~lustre~memchecker+romio+rsh~singularity+static+vt+wrapper-rpath build_system=autotools cuda_arch=none fabrics=ucx schedulers=none arch=linux-centos7-zen3

^[email protected]%[email protected]~assertions~backtrace_detail~cma+cuda+dc~debug~dm+examples~gdrcopy~ib_hw_tm~java+knem~logging~mlx5_dv+openmp+optimizations~parameter_checking+pic+rc~rdmacm~rocm+thread_multiple~ucg~ud~verbs~vfs~xpmem build_system=autotools cuda_arch=none libs=shared,static opt=3 patches=32fce32 simd=auto arch=linux-centos7-zen3

The command to run the benchmark was same as in NGC case.

mpirun \
  -np 8 \
  lmp \
    -k on g 8 \
    -sf kk \
    -pk kokkos cuda/aware on neigh half comm device neigh/qeq full newton on \
    -var x 16 \
    -var y 16 \
    -var z 16 \
    -var steps 1000 \
    -nocite \
    -in in.reaxc.hns \
    -log reacx-spack.log

This time, I obtained only 1.74E+7 ATOM-Time Step/s.

Since I am not yet allowed to upload file, here are links to the two outputs files:
reax-ngc.log
reacx-spack.log

Kokkos is open source and I don’t think NVIDIA has made addition optimizations to create such large performance gap. Any insights and suggestions are much appreciated.

Regards.

mkanski · March 9, 2023, 10:34am

The Pair timings appear to be the same, the difference is from Comm and Modify. I suspect the issue is related to fix qeq, can you try redoing the benchmarks without it? You’ll need to add checkqeq no to thepair_style reaxff line.

You should be able to extract the compiler and OpenMPI library Nvidia used by running lmp -h.

stamoor · March 9, 2023, 3:24pm

In general UVM “on” is slower than “off”, though I doubt that would explain the whole performance difference.

ehrler · March 10, 2023, 5:15am

Thanks for pointing out the difference in MPI timing.

The input file, per your suggestion is now

pair_style        reax/c NULL checkqeq no

However it doesn’t have much impact on performance in both cases.
Since I use the same inputs in both tests, I don’t think the above settings affects one but not the other.

The NGC’s lammps was built with the following parameters, via -h option:

OS: Linux "Ubuntu 20.04.4 LTS" 3.10.0-1160.36.2.el7.x86_64 x86_64

Compiler: GNU C++ 10.3.0 with OpenMP 4.5
C++ standard: C++14
MPI v3.1: Open MPI v4.1.3rc2, package: Open MPI root@6055a100160c Distribution, ident: 4.1.3rc2, repo rev: v4.1.3, Unreleased developer copy

Accelerator configuration:

KOKKOS package API: CUDA Serial
KOKKOS package precision: double
OPENMP package API: OpenMP
OPENMP package precision: double

Active compile time flags:

-DLAMMPS_GZIP
-DLAMMPS_SMALLBIG
sizeof(smallint): 32-bit
sizeof(imageint): 32-bit
sizeof(tagint):   32-bit
sizeof(bigint):   64-bit

Installed packages:

ASPHERE DPD-BASIC KOKKOS KSPACE MANYBODY MISC ML-SNAP MOLECULE MPIIO OPENMP
REAXFF REPLICA RIGID

For Spack, the built parameters are:

OS: Linux "CentOS Linux 7 (Core)" 3.10.0-1160.36.2.el7.x86_64 x86_64

Compiler: GNU C++ 11.3.0 with OpenMP 4.5
C++ standard: C++14
MPI v3.1: Open MPI v4.1.4, package: Open MPI optpar01@glogin01 Distribution, ident: 4.1.4, repo rev: v4.1.4, May 26, 2022

Accelerator configuration:

KOKKOS package API: CUDA Serial
KOKKOS package precision: double
OPENMP package API: OpenMP
OPENMP package precision: double

Active compile time flags:

-DLAMMPS_GZIP
-DLAMMPS_PNG
-DLAMMPS_JPEG
-DLAMMPS_FFMPEG
-DLAMMPS_SMALLBIG
sizeof(smallint): 32-bit
sizeof(imageint): 32-bit
sizeof(tagint):   32-bit
sizeof(bigint):   64-bit

Installed packages:

KIM KOKKOS KSPACE MANYBODY MOLECULE OPENMP OPT REAXFF

So with the exceptions of GCC’s verison, the built parameters are same.
Based on your comment, I will double check the performance of Open MPI between NGC and Spack.

mkanski · March 10, 2023, 7:23am

Sorry for the confusion. To run simulations without fix qeq, you also need to comment out its line in the input file. checkqeq no is an additional requirement.

ehrler · March 10, 2023, 8:02am

@mkanski
Thanks. My lack of experiences have caused confusion.
I have now disabled the qeq-related fix style

pair_style        reax/c NULL checkqeq no

fix               1 all nve
#fix               2 all qeq/reax 1 0.0 10.0 1e-6 reax/c

The above indeed leads to better performance in both cases. For simplicity, I just listed here the timesteps/s
NGC: 21.2 (original), 47.6 (no fix qeq)
Spack: 13.4 (original), 38.83 (no fix qeq)

The culprit as stamoor have suggested is the inclusion of UVM.

@stamoor
Thanks for your suggestion. Indeed building Kokkos without UVM recovers the performance lost w.r.t to NGC container. I would not have expected such large penalty in the first place.

For reference, I listed here the full command to build lammps/kokkos with Spack.

spack install lammps@20220504 %[email protected] target=zen3 +kspace +manybody +molecule +opt +openmp-package +openmp +reaxff +kokkos \
    ^kokkos +wrapper +cuda cuda_arch=80 +cuda_lambda \ 
    ^openmpi +cuda +internal-hwloc +legacylaunchers fabrics=ucx \
        ^ucx +cuda +dc +dm +knem +rc

The performance is now 21.3 timesteps/s, in good agreement with NGC result.

If there is another comment related to LAMMPS/Kokkos, please let us now.