Support for multiple GPU architectures

elcorto · March 7, 2025, 1:35pm

Dear experts

Part of our compute infrastructure (for test runs mostly) is heterogeneous in terms of Nvidia GPU generations (P100, V100, A100) and we are looking for ways to have LAMMPS support this, even if it means a slight drop in performance.

We compile LAMMPS using Kokkos to get GPU support. We also build the Python extension module and use that downstream. One must specify exactly one GPU architecture / CUDA compute capability major version to compile for and we first assumed that a binary for, say, a V100 GPU (cmake -D Kokkos_ARCH_VOLTA70=yes) would also run on all newer architectures, since the kokkos-cuda.cmake preset says

# preset that enables KOKKOS and selects CUDA compilation with OpenMP
# enabled as well. This preselects CC 5.0 as default GPU arch, since
# that is compatible with all higher CC, but not the default CC 3.5
[ ... ]
set(Kokkos_ARCH_PASCAL60 ON CACHE BOOL "" FORCE)

However, when running things built for V100 on an A100 GPU, we see the error

Kokkos::Cuda::initialize ERROR: likely mismatch of architecture

which is also what the LAMMPS + Kokkos docs say would happen (no support across compute capability major versions).

Currently we build separate LAMMPS extension modules for different architectures to handle that, and just want to make sure we’re doing things by the book, so my questions would be:

The kokkos-cuda.cmake preset seems to suggest that support for multiple architectures exists, but the LAMMPS + Kokkos docs and our tests imply the opposite. Which one is correct? We want to make sure that we see the error because there is no cross-architecture support rather than a compile mistake on our end.
As a solution, is it possible to build a “fat binary” by compiling for multiple selected architectures? We already found that something like
```
cmake -D Kokkos_ARCH_VOLTA70=yes -D Kokkos_ARCH_AMPERE80=yes
```
doesn’t work (cmake says “use exactly one architecture”).
Besides using Kokkos to get GPU support, there is also LAMMPS’ own GPU package. Its docs say that it supports “all major GPU architectures supported by this [CUDA] toolkit. Thus the GPU_ARCH setting is merely an optimization, to have code for the preferred GPU architecture directly included rather than having to wait for the JIT compiler of the CUDA driver to translate it.” – That sounds exactly like the behavior we are looking for, so should we use that instead?

Thanks in advance.

akohlmey · March 7, 2025, 1:49pm

Always the documentation.

Technically, yes, but it won’t work with KOKKOS. I have managed to compile such an executable, but the behavior will be that it appears as if LAMMPS is stuck while the JIT compiler will recompile all GPU kernels, but then it will stop with an error regardless.This is a design decision of the Kokkos developers and thus limitation of the Kokkos library itself. The information that I got when asking about it was that it may work for minor architecture differences, but it is not supposed to work for major architecture differences, because of different architecture dependent code paths.

When compiling with CMake, the GPU package will always build “fat” binaries for CUDA and should thus create binaries that will run on different Nvidia GPU architectures. Also, you can compile it in OpenCL mode and then the same executable will even work on Intel or AMD GPUs in addition to Nvidia GPUs. It even has some internal heuristics to optimize the kernels for individual GPU architectures.

Whether KOKKOS or GPU package is the better choice depends on your application. In some cases only KOKKOS provides GPU support, in others only the GPU package. Also, there are performance differences and KOKKOS currently only support full double precision.

A pragmatic solution would be to compile multiple binaries under different names and then create a wrapper shell script that detects which GPU a machine has (e.g. by parsing the output of lspci -mm | grep VGA) and then selects which executable name to use based on that information and then just do an exec ${exename} "$@".

stamoor · March 7, 2025, 3:30pm

I talked to the Kokkos developers on their Slack channel and it appears that it should work, but the call is failing in CUDA, not Kokkos. They suggested that the necessary PTX may not be embedded in the binary, and tweaking the compile flags may help.

@elcorto Can you please post your compiler and linker flags here?

For reference, here is the check in Kokkos:

  // Query what compute capability architecture a kernel executes:
  Impl::CudaInternal::m_cudaArch = Impl::cuda_kernel_arch(cuda_device_id);

  if (Impl::CudaInternal::m_cudaArch == 0) {
    Kokkos::abort(
        "Kokkos::Cuda::initialize ERROR: likely mismatch of architecture\n");
  }
  
  int compiled_major = Impl::CudaInternal::m_cudaArch / 100;
  int compiled_minor = (Impl::CudaInternal::m_cudaArch % 100) / 10;
  
  if ((compiled_major > cudaProp.major) ||
      ((compiled_major == cudaProp.major) &&
       (compiled_minor > cudaProp.minor))) {
    std::stringstream ss;
    ss << "Kokkos::Cuda::initialize ERROR: running kernels compiled for "
          "compute capability " 
       << compiled_major << "." << compiled_minor
       << " on device with compute capability " << cudaProp.major << "."
       << cudaProp.minor << " is not supported by CUDA!\n";
    std::string msg = ss.str();
    Kokkos::abort(msg.c_str()); 
  } 
  if (Kokkos::show_warnings() &&
      (compiled_major != cudaProp.major || compiled_minor != cudaProp.minor)) {
    std::cerr << "Kokkos::Cuda::initialize WARNING: running kernels compiled "
                 "for compute capability "
              << compiled_major << "." << compiled_minor
              << " on device with compute capability " << cudaProp.major << "."
              << cudaProp.minor
              << " , this will likely reduce potential performance."
              << std::endl;
  }

stamoor · March 7, 2025, 6:06pm

This issue also seems related: Allow compiling multiple CUDA architectures · Issue #7834 · kokkos/kokkos · GitHub, I put a note there.

elcorto · March 7, 2025, 7:49pm

OK that is what is meant by the note in the LAMMPS + Kokkos docs.

That’s good to know, thanks. Unfortunately, the way our application (which is this) interfaces with LAMMPS requires GPU support to be built with Kokkos. I have to double-check with the other devs, though.

True, this is the first solution we had in mind, but wanted to be sure that this was the best option, hence the question here.

elcorto · March 7, 2025, 8:13pm

Thank you very much for your efforts! Christian Robert Trott of the Kokkos team was so kind to share his response via email, which I’ll post below:

Stan Brough that yesterday to the Kokkos slack channel. In principle
this should work though is not tested and their might be some issues we
are not aware off. (Most likely would be running Pascal on Volta and
there are somewhere implicit synchronization assumptions if you
compiled for Pascal, though I am 95% sure it’s ok). I just tested again
and this does work. Note we changed that sometime early last year
(relaxed the restrictions). Assuming you actually used one of the
latest Kokkos/LAMMPS versions the error you get indicates that the
binary is truly not forward compatible (because you only get that error
message if a simple raw CUDA kernel that returns the device
architecture, did fail to run correct).

I am not sure how you would get into that state however, since by
default we (Kokkos) generate binaries which contain PTX and thus are
forward compatible. Your actually verbose compile lines may give a
clue.

Even if you do this, it has a bunch of downsides: for example we won’t
utilize features of newer hardware that older hardware doesn’t have
like half precision, or certain types of hardware atomics. Most of
LAMMPS shouldn’t use that anyway though. The other issue as Siva said
is that this is not a general solution across architectures. This works
only within NVIDIA. Not sure about AMD: my gut feeling is that would
not work.

I can gather and share compile output as suggested above if needed. In the meantime, here is how we build LAMMPS.

To provide more context, we actually do this as part of a docker image build, which we convert to an apptainer/singularity image and run on GPU nodes via apptainer exec --nv, I’m not sure if that is known to cause problems in terms of forward compatibility.

cuda_version=12.4.1
lammps_version=patch_4Feb2025
lammps_gpu_arch=VOLTA70
lammps_cpu_arch=HSW
py_version=3.11
sys_install_path=/usr/local/lib/python${py_version}/dist-packages

spack install cuda@$cuda_version arch=x86_64

# The lammps linker wants libcuda.so.1 which is not exposed by spack's cuda
# package and also not by the equiv module cuda/12.4 on the cluster:
#   find /trinity/shared/pkg/devel/cuda/12.4/ -name "libcuda.so.1"
# There is only stubs/libcuda.so . So create a link here to make the linker
# happy. When running this image (after apptainer convert) with apptainer exec
# --nv, a libcuda.so.1 is mapped into the container.
spack load cuda@$cuda_version
cd $(spack location -i cuda@$cuda_version)/lib64
ln -s stubs/libcuda.so libcuda.so.1

cd /opt/soft/git
git clone --branch=$lammps_version --depth=1 https://github.com/lammps/lammps.git
export LD_LIBRARY_PATH=$CUDA_HOME/lib64${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}
cd /opt/soft/git/lammps
mkdir -pv build
cd build
rm -rf *
cmake ../cmake \
    -D PKG_KOKKOS=yes \
    -D BUILD_MPI=yes \
    -D PKG_ML-SNAP=yes \
    -D Kokkos_ENABLE_CUDA=yes \
    -D Kokkos_ARCH_${lammps_cpu_arch}=yes \
    -D Kokkos_ARCH_${lammps_gpu_arch}=yes \
    -D CMAKE_CXX_COMPILER=$(pwd)/../lib/kokkos/bin/nvcc_wrapper \
    -D BUILD_SHARED_LIBS=yes

cmake --build . --parallel=8

# Build Python extension
cd /opt/soft/git/lammps/python
python3 -m venv __tmp_env
. ./__tmp_env/bin/activate
python install.py -p lammps -l ../build/liblammps.so -v ../src/version.h
deactivate
cp -rv __tmp_env/lib/python${py_version}/site-packages/lammps* $sys_install_path/

Note that the CUDA we use during build is not part of the final image. At runtime, we make the host CUDA available to the LAMMPS Python extension inside the container such that the two CUDA shared libs which are needed at runtime are present. So roughly

host$ module load cuda/12.4
host$ apptainer shell --nv --cleanenv --contain --bind /path/to/host/cuda/lib64 image.sif
Apptainer> export LD_LIBRARY_PATH=/path/to/host/cuda/lib64:$LD_LIBRARY_PATH
Apptainer> ldd /usr/local/lib/python3.11/dist-packages/lammps/liblammps.so | grep cud
        libcuda.so.1 => /.singularity.d/libs/libcuda.so.1 (0x00002aaab044f000)
        libcudart.so.12 => /path/to/host/cuda/lib64/libcudart.so.12 (0x00002aaab2200000)

elcorto · March 8, 2025, 7:50pm

Here are the build logs. I can’t upload things here so I made it available elsewhere.

elcorto · March 13, 2025, 11:22am

We made progress and (potentially) solved the issue.

Our machines have a fixed CUDA driver version (CUDA toolkit 12.1, libcuda.so.1 → libcuda.so.530.30.02. What we did wrong was to compile LAMMPS using a newer toolkit version (12.4). Using that, things do run, but only on the compiled-for target GPU architecture. On others we see the “mismatch of architecture” error.

If we compile LAMMPS with CUDA toolkit version 12.1 (same as host CUDA driver’s toolkit version), then we seem to have forward compatibility. For example, if we compile for PASCAL60, we can run on V100 (VOLTA70) or A100 (AMPERE80). Results are the same (up to numerical noise) compared “native” (e.g. VOLTA70 on V100).

There is one remaining question: Running in what we think is the forward compatibility mode, when LAMMPS is called for the first time, we notice a wait of maybe 1 min, before calculations continue. We assumed that this is where the host CUDA JIT is recompiling kernels for the current GPU architecture. However, this message from Kokkos (PASCAL60 on A100 example)

Kokkos::Cuda::initialize WARNING: running kernels compiled for compute capability 6.0 on device with compute capability 8.0 , this will likely reduce potential performance.

seems to suggest that this not the case and instead the compute capability 6.0
(PASCAL60) machine code is used. Is that correct?

akohlmey · March 13, 2025, 12:41pm

This is backward compatibility. The newer architecture is backward compatible with executable code for a previous architecture.

That is rather short, I’ve seen longer way back when I was experimenting.

Yes.

Yes and no. An older compute capability is missing features that a newer architecture has. Thus at (LAMMPS) compile time, the code has to be compiled to be compatible with the lowest common denominator. That makes optimizations that can only work on newer architectures unavailable. So there is a potential slowdown compared to a kernel compiled for that specific architecture.

elcorto · March 13, 2025, 5:53pm

Thank you very much, that answers all our questions.

In terms of performance, we’re OK with that for the target use case, which is small scale runs on whatever hardware happens to be available. For large scale production, we opt for homogeneous hardware and a LAMMPS compiled for that.