LAMMPS-SNAP performance drop with LLVM/Clang on Nvidia A100

Hello, Can someone help with this please🙏 ? I am running LAMMPS-SNAP (23 June 2022 release: NERSC / N10 Benchmarks / Materials by Design Workflow · GitLab) written in Kokkos 3.6.1 with CUDA 12.2 (Nvidia A100 on Perlmutter) backend as a test application and I notice a ~20% difference in loop time when clang [LLVM17,18,19,20 all behave similarly] is used as a device compiler than nvcc12.2 keeping same MPI implementation. Can someone guide to where the issue might be or how to start debugging?
I am new to both compilers and LAMMPS so please tolerate some naive questions.

is there a reason you need to compile with clang ?

would it not be more appropriate to use the vendor’s compiler for their own hardware since nvidia is more likely to have a compiler highly optimized for A100 (sometimes using undocumented tricks) ?? my guess is the nvidia compiler knows how to produce optimized code for the latest A100 tensor cores while clang does not.

it’s also time to update your LAMMPS to latest (develop branch kokkos lib has been upgraded to 4.3.01).

git clone https://github.com/lammps/lammps.git

then read:

https://docs.lammps.org/Build_cmake.html

1 Like

Thank you so much for the response, really appreciate it. The only reason for using clang for device is my organization is focusing on adopting open source implementations instead of vendor provided ones. The goal is to get a difference of less than 10%. I would also like to mention few insights:

  1. ComputeZi SNAP kernel has the biggest difference in loop time than other kernels during profiling.
  2. Lammps current stable (commit: 46265e3) version has a flag which is set with only clang on device: KOKKOS_IMPL_CUDA_CLANG_WORKAROUND. I didn’t find any documentations about why it exist and what does it do

Please let me know what do you think?

Please note that the SNAP code has been heavily optimized with the help of software engineers from both Nvidia and AMD to give the best performance possible with the “native” tools on the top supercomputers.

If you don’t use those tools (for whatever reason) you must expect a degraded performance.

Kokkos is still a rapidly moving target. By using the stable branch you are significantly behind in the development. Please note that good performance is due to both, Kokkos and the LAMMPS code.

Any of these kinds of KOKKOS_IMPL flags are part of Kokkos, so you need to consult the Kokkos documentation and the Kokkos developers to find out more. https://kokkos.org/

Thank you so much for the quick response. I do understand your point and it does makes sense that spending time to optimize for the last 20% will not be worth it.

Also regarding LAMMPS in develop branch, it was mentioned somewhere in build docs that Kokkos versions that comes with LAMMPS are heavily tested before every release so do you think a naive replacement with the latest Kokkos version will work?

Gratefully,
Shubh

Yes normally it works fine.

Thank you so much, apologies for the delay in response. I tried a naive replacement with Kokkos 4.4.0 for LAMMPS (Update 4 for Stable release 2 August 2023) and I got the following error:

LAMMPS (2 Aug 2023 - Update 4)
KOKKOS mode with Kokkos version 4.4.0 is enabled (src/KOKKOS/kokkos.cpp:108)
  will use up to 4 GPU(s) per node
  using 1 OpenMP thread(s) per MPI tasks
...
...
cxil_map: write error
MPICH ERROR [Rank 1] [job id 29930307.0] [Thu Aug 29 02:15:43 2024] [nid001005] - Abort(471444751) (rank 1 in comm 0): Fatal error in PMPI_Irecv: Other MPI error, error stack:
PMPI_Irecv(166)........: MPI_Irecv(buf=0x6a4f4be80, count=589824, MPI_DOUBLE, src=33, tag=0, MPI_COMM_WORLD, request=0x7ffc42ad33f4) failed
MPID_Irecv(529)........: 
MPIDI_irecv_unsafe(163): 
MPIDI_OFI_do_irecv(356): OFI tagged recv failed (ofi_recv.h:356:MPIDI_OFI_do_irecv:Invalid argument)

aborting job:
Fatal error in PMPI_Irecv: Other MPI error, error stack:

CMake build command:

cmake -D CMAKE_C_COMPILER=cc -D CMAKE_CXX_COMPILER=CC -D CMAKE_BUILD_TYPE=Release -D CMAKE_INSTALL_PREFIX=${PSCRATCH}/exaalt_device_compilers/LMP_TEST   -D LAMMPS_EXCEPTIONS=on     -D BUILD_SHARED_LIBS=yes     -D BUILD_MPI=yes     -D PKG_KOKKOS=yes -D Kokkos_ARCH_AMPERE80=ON -D Kokkos_ENABLE_CUDA=yes  -D PKG_MANYBODY=yes     -D PKG_REPLICA=yes     -D PKG_ML-SNAP=yes     -D PKG_EXTRA-FIX=yes     -D PKG_MPIIO=yes     -D LAMMPS_SIZES=BIGBIG   -D CMAKE_CXX_STANDARD=17  ../cmake

Please let me know what do you think about this.

Can you try disabling async malloc:

For CMake you need to compile with Kokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC:BOOL=OFF or KOKKOS_CUDA_OPTIONS = "enable_lambda,disable_malloc_async" for Makefile.

Otherwise your MPI may not be CUDA aware.

my experience is you need to build your own openmpi configured for CUDA and your local cluster environment, otherwise you will get into all kinds of troubles.

this is how i did it:

wget https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-5.0.5.tar.gz
tar xzvf openmpi-5.0.5.tar.gz; cd openmpi-5.0.5

module load StdEnv/2023 gcc/12.3 cuda/12.2

./configure --prefix=$HOME/local/openmpi-5.0.5 --with-cuda=$CUDA_HOME \
--with-cuda-libdir=$CUDA_HOME/lib64/stubs --disable-io-romio \
--without-knem --with-io-romio-flags=--without-ze ; make -j 64 all; make install

StdEnv/2023 is the standard module for my local clusters. yours will be different.

openmpi configure options for your local cluster might also be different, consult openmpi documentation:

https://docs.open-mpi.org/en/v5.0.x/installing-open-mpi/index.html

$HOME/local/modules/openmpi/5.0.5 :

#%Module
set prefix {$HOME/local/openmpi-5.0.5}
set version {5.0.5}
prepend-path CMAKE_PREFIX_PATH ${prefix}
prepend-path PATH ${prefix}/bin
prepend-path CPATH ${prefix}/include
prepend-path LIBRARY_PATH ${prefix}/lib
prepend-path LD_LIBRARY_PATH ${prefix}/lib
prepend-path MANPATH ${prefix}/share/man
prepend-path PKG_CONFIG_PATH ${prefix}/lib/pkgconfig
setenv MODULE_OPENMPI_PREFIX ${prefix}
prepend-path MODULEPATH $HOME/local/openmpi-5.0.5

and then you can use

module use $HOME/local/modules
module load StdEnv/2023 gcc/12.3 cuda/12.2 openmpi/5.0.5

before building and running lmp.

1 Like

Thank you so much for the answer, really appreciate your efforts. I will try this out but I would like to mention that I am running this with Cray-MPICH 8.1.28 on Perlmutter at NERSC which has gcc12.3 for host and nvcc12.2 for device, Also I only see this error with kokkos 4.4 and not with kokkos 3.7.2. Please let me know if you would still expect a difference with a custom openmpi build?

Thank you, I will test this out and report back. Also LAMMPS (Update 4 for Stable release 2 August 2023) only builds with clang>=19, shouldn’t this be mentioned anywhere in the release docs (I am happy to help) as I had to try out multiple versions starting from clang16 to see if anything works.

Please be careful with such statements. It is not correct that LAMMPS requires these clang compiler versions. LAMMPS itself requires C++11 and thus supports a wide variety of compilers (back to GCC 4.8.x on CentOS 7 which is barely C++11 compatible).

This, however, is not true for Kokkos. LAMMPS 2 Aug 2023 has only been vetted with Kokkos 3.x, which requires a C++14 capable compiler. So by using a Kokkos 4.x version, you are taking a significant risk of incompatibilities; perhaps not in syntax (or else it would not compile), but in semantics.

The current LAMMPS development version (hopefully released later today as new stable version), has been vetted with Kokkos 4.3.x and that requires C++17. I have been able to compile a LAMMPS executable with these sources including GPU support for an AMD GPU using the hipcc compiler which is based on Clang 17.

You have to check the release notes of Kokkos to determine which compilers are compatible with it. It may be that the conditions for your use mode (GPU support via Clang) may be different from requirements for Kokkos/OpenMP for Kokkos/HIP for example.

Apologies for my naive conclusion, I will be more careful next time. I really appreciate the clarification, I am still learning and this is incredibly helpful.

This works, thank you so much, is it possible to elaborate more on why this worked and why is it not the default?

The version of UCX used by some MPI does not support cudaMallocAsync. For more info see these links:

You are not the first person to have this issue and I’m to the point where I think we should change the default, at least until it gets fixed. I will plan to submit a PR.

1 Like

Thank you so much for the links, it helps a lot.