Lammps with Kokkos using SYCL for Intel PVC GPU

I am trying to use the makefile.aurora_kokkos that y’all recommende.

The mpicxx that comes as part of my intel compilers is actually a wrapper around gnu g++ and doesn’t understand sycl flags. When I tried to use the intel compiler that actually does understand sycl flags, I got undefined reference at the linking step.

make[1]: Entering directory '[...]/lammps/src/Obj_aurora_kokkos'
mpiicpc -cxx=icpx -g -O3 main.o      -L. -llammps_aurora_kokkos -lkokkos -ldl   -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core  -march=sapphirerapids -mtune=sapphirer
apids -fsycl -fsycl-targets=spir64_gen -Xsycl-target-backend "-device 12.60.7" -L[...]/lammps/src/Obj_aurora_kokkos   -
o ../lmp_aurora_kokkos
bin/ld: [...]/libmkl_intel_thread.so: undefined reference to `__kmpc_atomic_fixed4_rd'
icpx: error: linker command failed with exit code 1 (use -v to see invocation)

I added -lpthread -liomp5 and I got past that. I expect this trouble is because I asked for threaded MKL to support package kspace.

Can try adding -fiopenmp to link line to pull in OpenMP bits.

nm, I now see bottom part of message where you added libraries.

I was able to get the makefile.aurora_kokkos working with some modifications.

I used a version of the compiler that’s as close to what aurora as I could find. I have 2023.1.0, but the executable names are different for some reason. mpicxx is not where it’s at, anymore.

# ---------------------------------------------------------------------
# compiler/linker settings
# specify flags and libraries needed for your compiler

CC =		mpiicpc -cxx=icpx

LINK =		mpiicpc -cxx=icpx
LINKFLAGS =	-g -O3 -fsycl-link-huge-device-code -fsycl-max-parallel-link-jobs=30


# ---------------------------------------------------------------------
# LAMMPS-specific settings, all OPTIONAL

# MPI library

MPI_LIB = -lpthread -liomp5

# FFT library

FFT_INC = -DFFT_MKL -DFFT_MKL_THREADS
FFT_PATH =
FFT_LIB = -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lpthread 
1 Like

Executable works fine for simple case, but when I try to run my Rhodopsin benchmark that uses FFT, I get a segmentation fault. (It works fine in serial mode, too. The crash is just on the GPU. )

[ac043:627935:0:627935] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xff0000001c800080)
==== backtrace (tid: 627935) ====
 0 0x0000000000619409 GTPin_Init()  ???:0
 1 0x00000000005783a9 zeKernelSetIndirectAccessTracing()  ???:0
 2 0x0000000000012cf0 __funlockfile()  :0
 3 0x0000000001ad2dc4 mkl_dft_avx512_mg_colbatch_plain_fwd_09_d()  ???:0
 4 0x0000000001a7943c compute_mg_row_fwd()  bkd_c2c_1d_mg_d.c:0
 5 0x0000000000ba45ea DftiComputeForward()  ???:0
 6 0x0000000001b50dd0 LAMMPS_NS::FFT3dKokkos<Kokkos::Experimental::SYCL>::fft_3d_kokkos()  [...]/fft3d_kokkos.cpp:227
 7 0x0000000001b50a3e LAMMPS_NS::FFT3dKokkos<Kokkos::Experimental::SYCL>::compute()  [...]/fft3d_kokkos.cpp:99
 8 0x0000000001b50a3e ~ViewTracker()  [...]impl/Kokkos_ViewTracker.hpp:39
 9 0x0000000001b50a3e ~View()  [...]Kokkos_View.hpp:1269
10 0x0000000001b50a3e LAMMPS_NS::FFT3dKokkos<Kokkos::Experimental::SYCL>::compute()  [...]/fft3d_kokkos.cpp:99

Intel link help suggests I need to do it like this to enable GPU offloading
-fiopenmp -fopenmp-targets=spir64 -fsycl -L${MKLROOT}/lib/intel64 -lmkl_sycl_undefined -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5 -lsycl -lstdc++ -lpthread -lm -ldl

As Stan indicated earlier, adding MKL FFT support for PPPM is a work-in-progress for the SYCL backend (should be straightforward, but we’ve had fun with other priorities). You could drop the kspace style and switch to something like lj/charmm/coul/charmm if you want to get a sense of how everything else in in.rhodo runs on your local setup.

Not sure if using DftiComputeForward is supported on the GPU? We need to add support for oneapi::mkl::dft::compute_forward which should work.

I’m not attached to MKL FFT. Is there any other FFT I can choose that will enable kspace on SYCL?

The Kokkos version of KISS FFT was supposed to be the portable fallback that works with any backend, but as you know it is ironically broken due to SYCL not supporting recursive functions on the device. There is also heFFTe but it isn’t enabled for the KOKKOS package yet. You could run PPPM on the host CPU. So porting MKL is pretty much required at this point.

I found oneapi/mkl/dfti.hpp on my system which provides compute_forward and friends. Would it be sufficient to swap out the include statement and the function calls?

That would be most of the work, but we would also need to get the SYCL queue from Kokkos and pass that into the calls. I don’t think any of this is very hard, and there are some examples provided by Intel, but it just hasn’t made it to the top of our to-do list.

Does this file exist in your oneapi installation? /opt/intel/oneapi/mkl/latest/examples/examples_dpcpp.tgz
It has some examples for 1D FFTs with oneapi/mkl.

Yes, I have one of those.

OK the dpcpp/dft/source/dp_complex_1d.cpp example is good. I’m asking on Kokkos slack about getting the SYCL queue through Kokkos. Will try to get this working real quick if possible.

I am trying experiments using the build that I have. I found that a behavior with MPI that is confusing. Even though I am using an Intel mpi (not mpich), I get this warning:

WARNING: Detected MPICH. Disabling GPU-aware MPI (../kokkos.cpp:341)

And the performance is bad: increasing the number of threads and/or the number of GPU causes the performance to decrease.

If I try to force it with -pk kokkos gpu/aware on, I just get segmentation faults.

This tells me that my lammps build doesn’t understand Intel’s GPU-aware MPI.

Intel’s website says:

GPU aware features of Intel MPI Library can be used from within both SYCL and OpenMP-offload based applications."

What do you think? Did I do something wrong, or is this another feature that’s not supported yet?

I can’t comment on level of support in Intel MPI from the public SDKs. I can say the top portion of kokkos.cpp will need to be updated to detect GPU-awareness and assign devices to MPI ranks (or you can do the MPI-GPU binding yourself).

For gpu-aware testing, I would take a step back and try a simple code to make sure it works in your local setup.

What input-deck are you testing and what are you comparing? multi-core CPU? A100?

Intel MPI is a derivative of MPICH, see Intel® MPI Library. If you are getting segmentation faults with -pk kokkos gpu/aware on then that means your MPI really isn’t GPU aware; but it may just need an env var set. For example export I_MPI_OFFLOAD=1 see: GPU Support and Intel® MPI for GPU Clusters.

We don’t auto-detect I_MPI_OFFLOAD=1 in LAMMPS yet, but setting that plus -pk kokkos gpu/aware on could work.

Another note: using export MPICH_GPU_SUPPORT_ENABLED=1 works on Aurora, Frontier, Polaris, and Perlmutter, but not sure if it is general or specific to those machines which have Cray MPICH?

Howdy folks,

Have you all seen my recent short paper from PEARC '24?

Performance of Molecular Dynamics Acceleration Strategies on Composable Cyberinfrastructure

It basically says

“Dear Intel, please support Kokkos, we really need it for LAMMPS.”

@rarensu thanks for the link, will check it out.

@rarensu the CMake and FFT issues with Kokkos SYCL on Intel GPUs should be fixed by: Intel GPU updates: kspace & cmake by cjknight · Pull Request #4313 · lammps/lammps · GitHub.