Hello, Can someone help with this please🙏 ? I am running LAMMPS-SNAP (23 June 2022 release: NERSC / N10 Benchmarks / Materials by Design Workflow · GitLab) written in Kokkos 3.6.1 with CUDA 12.2 (Nvidia A100 on Perlmutter) backend as a test application and I notice a ~20% difference in loop time when clang [LLVM17,18,19,20 all behave similarly] is used as a device compiler than nvcc12.2 keeping same MPI implementation. Can someone guide to where the issue might be or how to start debugging?
I am new to both compilers and LAMMPS so please tolerate some naive questions.
is there a reason you need to compile with clang ?
would it not be more appropriate to use the vendor’s compiler for their own hardware since nvidia is more likely to have a compiler highly optimized for A100 (sometimes using undocumented tricks) ?? my guess is the nvidia compiler knows how to produce optimized code for the latest A100 tensor cores while clang does not.
it’s also time to update your LAMMPS to latest (develop branch kokkos lib has been upgraded to 4.3.01).
Thank you so much for the response, really appreciate it. The only reason for using clang for device is my organization is focusing on adopting open source implementations instead of vendor provided ones. The goal is to get a difference of less than 10%. I would also like to mention few insights:
ComputeZi SNAP kernel has the biggest difference in loop time than other kernels during profiling.
Lammps current stable (commit: 46265e3) version has a flag which is set with only clang on device: KOKKOS_IMPL_CUDA_CLANG_WORKAROUND. I didn’t find any documentations about why it exist and what does it do
Please note that the SNAP code has been heavily optimized with the help of software engineers from both Nvidia and AMD to give the best performance possible with the “native” tools on the top supercomputers.
If you don’t use those tools (for whatever reason) you must expect a degraded performance.
Kokkos is still a rapidly moving target. By using the stable branch you are significantly behind in the development. Please note that good performance is due to both, Kokkos and the LAMMPS code.
Any of these kinds of KOKKOS_IMPL flags are part of Kokkos, so you need to consult the Kokkos documentation and the Kokkos developers to find out more. https://kokkos.org/
Thank you so much for the quick response. I do understand your point and it does makes sense that spending time to optimize for the last 20% will not be worth it.
Also regarding LAMMPS in develop branch, it was mentioned somewhere in build docs that Kokkos versions that comes with LAMMPS are heavily tested before every release so do you think a naive replacement with the latest Kokkos version will work?
Thank you so much, apologies for the delay in response. I tried a naive replacement with Kokkos 4.4.0 for LAMMPS (Update 4 for Stable release 2 August 2023) and I got the following error:
LAMMPS (2 Aug 2023 - Update 4)
KOKKOS mode with Kokkos version 4.4.0 is enabled (src/KOKKOS/kokkos.cpp:108)
will use up to 4 GPU(s) per node
using 1 OpenMP thread(s) per MPI tasks
...
...
cxil_map: write error
MPICH ERROR [Rank 1] [job id 29930307.0] [Thu Aug 29 02:15:43 2024] [nid001005] - Abort(471444751) (rank 1 in comm 0): Fatal error in PMPI_Irecv: Other MPI error, error stack:
PMPI_Irecv(166)........: MPI_Irecv(buf=0x6a4f4be80, count=589824, MPI_DOUBLE, src=33, tag=0, MPI_COMM_WORLD, request=0x7ffc42ad33f4) failed
MPID_Irecv(529)........:
MPIDI_irecv_unsafe(163):
MPIDI_OFI_do_irecv(356): OFI tagged recv failed (ofi_recv.h:356:MPIDI_OFI_do_irecv:Invalid argument)
aborting job:
Fatal error in PMPI_Irecv: Other MPI error, error stack:
For CMake you need to compile with Kokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC:BOOL=OFF or KOKKOS_CUDA_OPTIONS = "enable_lambda,disable_malloc_async" for Makefile.
my experience is you need to build your own openmpi configured for CUDA and your local cluster environment, otherwise you will get into all kinds of troubles.
this is how i did it:
wget https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-5.0.5.tar.gz
tar xzvf openmpi-5.0.5.tar.gz; cd openmpi-5.0.5
module load StdEnv/2023 gcc/12.3 cuda/12.2
./configure --prefix=$HOME/local/openmpi-5.0.5 --with-cuda=$CUDA_HOME \
--with-cuda-libdir=$CUDA_HOME/lib64/stubs --disable-io-romio \
--without-knem --with-io-romio-flags=--without-ze ; make -j 64 all; make install
StdEnv/2023 is the standard module for my local clusters. yours will be different.
openmpi configure options for your local cluster might also be different, consult openmpi documentation:
Thank you so much for the answer, really appreciate your efforts. I will try this out but I would like to mention that I am running this with Cray-MPICH 8.1.28 on Perlmutter at NERSC which has gcc12.3 for host and nvcc12.2 for device, Also I only see this error with kokkos 4.4 and not with kokkos 3.7.2. Please let me know if you would still expect a difference with a custom openmpi build?
Thank you, I will test this out and report back. Also LAMMPS (Update 4 for Stable release 2 August 2023) only builds with clang>=19, shouldn’t this be mentioned anywhere in the release docs (I am happy to help) as I had to try out multiple versions starting from clang16 to see if anything works.
Please be careful with such statements. It is not correct that LAMMPS requires these clang compiler versions. LAMMPS itself requires C++11 and thus supports a wide variety of compilers (back to GCC 4.8.x on CentOS 7 which is barely C++11 compatible).
This, however, is not true for Kokkos. LAMMPS 2 Aug 2023 has only been vetted with Kokkos 3.x, which requires a C++14 capable compiler. So by using a Kokkos 4.x version, you are taking a significant risk of incompatibilities; perhaps not in syntax (or else it would not compile), but in semantics.
The current LAMMPS development version (hopefully released later today as new stable version), has been vetted with Kokkos 4.3.x and that requires C++17. I have been able to compile a LAMMPS executable with these sources including GPU support for an AMD GPU using the hipcc compiler which is based on Clang 17.
You have to check the release notes of Kokkos to determine which compilers are compatible with it. It may be that the conditions for your use mode (GPU support via Clang) may be different from requirements for Kokkos/OpenMP for Kokkos/HIP for example.
Apologies for my naive conclusion, I will be more careful next time. I really appreciate the clarification, I am still learning and this is incredibly helpful.
The version of UCX used by some MPI does not support cudaMallocAsync. For more info see these links:
You are not the first person to have this issue and I’m to the point where I think we should change the default, at least until it gets fixed. I will plan to submit a PR.