Lammps with Kokkos using SYCL for Intel PVC GPU

rarensu · November 25, 2024, 6:23pm

Great news, guys! I finally got this running. Here’s what I had to do.

OneAPI toolkit from a container named
intel/oneapi-hpckit:2025.0.0-0-devel-ubuntu24.04
LAMMPS commit 43fbdc2d9385715ac01f9218defc5beca0afc853
( nearest tag: patch_19Nov2024-2-g43fbdc2d93 )
cmake -C ./my-kokkos-sycl-intel.cmake -DPKG_MOLECULE=on -DPKG_RIGID=on -DPKG_KSPACE=on -DPKG_ML-SNAP=on -DFFT=MKL -DFFT_KOKKOS=MKL_GPU -DFFT_SINGLE=on ../cmake
I modified that preset my-kokkos-sycl-intel.cmake to include the settings that generate AOT kernel compiles. I also turned off OpenMP because it’s nothing but trouble.

I haven’t tested all the features yet, but I confirmed that the Rhodopsin benchmark runs on a single GPU.

stamoor · November 25, 2024, 7:14pm

@rarensu Good to hear!

rarensu · November 28, 2024, 7:04am

It turns out that cmake file I uploaded is no good. The hardcoded values for MPI are not correct, so no parallel is possible. Annoyingly, instead of complaining about it, cmake just defaulted to the builtin MPI stubs. After I commented out the line for MPI executable and instead turned on BUILD_MPI to encourage cmake to find MPI on its own, cmake was actually able to find and use it.

In addition to the variables @stamoor suggested (Jan 4), I also needed to do unset ZE_AFFINITY_MASK. For some reason, the latest Intel MPI refuses to do gpu/aware when that variable is set.

After rebuilding and setting all these annoying variables, I can now fully parallelize across multiple PVC GPUs. Even the kspace for Rhodopsin benchmark is working correctly. This is very exciting indeed.

I put my build in a container. If someone you know has PVCs and would like to try my build, I can share it.

rarensu · December 17, 2024, 5:24pm

Howdy Friends,

I have done enough benchmarks to find the next major problem.

timing, Rhodopsin, optimal atoms, single GPU, Kokkos

The modify time is way too high. Additional PVC gpus don’t add much performance either. I suspect that one of the fix commands is being done on the CPU or something like that.

Recall that Rhodopsin uses the npt and shake fixes.

Do you have any knowledge of the current state intel GPU support in the kokkos variants of those two fixes?

stamoor · December 17, 2024, 5:42pm

There is no difference in support for styles in the KOKKOS package for different backends. So any style that ran on the NVIDIA GPU means it also ran on the Intel GPU, not on the CPU. So I think something else is going on that we don’t understand yet. Can you use kokkos-tools, e.g. space-time-stack to profile and get more info? GitHub - kokkos/kokkos-tools: Kokkos C++ Performance Portability Programming Ecosystem: Profiling and Debugging Tools

Could also use iprof or VTune to profile LAMMPS on Intel GPUs as well. @Christopher_Knight may be able to give some pointers on this.

Christopher_Knight · December 17, 2024, 5:56pm

This is probably the atom-sort-on-host issue we’re trying to resolve.

rarensu · December 17, 2024, 7:21pm

I got you some profiling results to look at.
job.323950 (17.5 KB)

stamoor · December 17, 2024, 7:34pm

Hmmmm, the profile on GPU looks fine. Maybe a little initialization overhead that we could get rid of, but otherwise very reasonable. So that means the overhead is on the host CPU, which kokkos-tools won’t pick up. So you need to profile the host CPU. I typically use gprof by compiling with -pg but maybe there is a better tool for Intel, not sure.

rarensu · January 23, 2025, 3:37pm

Notes from Kokkos Slack about atom sorting.

Daniel Arndt Dec 18th, 2024 at 1:26 AM
I’m still not quite sure why the host fallback is used. This should only happen if the View is not contiguous for the SYCL backend. In any case, oneDPL: Sort on device using Kokkos::RandomAccessIterator by masterleinad · Pull Request #7502 · kokkos/kokkos · GitHub should make the behavior uniform with the Cuda and HIP but requires a oneDPL version that isn’t contained in the latest oneAPI release.

Daniel Arndt Jan 8th at 2:53 PM
The issue was fixed in Make kernel names unique in radix sort by dmitriy-sobolev · Pull Request #1927 · uxlfoundation/oneDPL · GitHub which is merged into main and part of the latest oneAPI patch release 2025.0.1 (but won’t get detected for that patch release because they missed updating version numbers so you really need main).

Notes from Kokkos Slack about atom masking.

Stan Moore Yesterday at 3:16 PM
This is the overhead due to parallel_reduce vs parallel_for

Stan Moore Yesterday at 3:28 PM
I added the perf regression fix here: Collected small changes and fixes by akohlmey · Pull Request #4440 · lammps/lammps · GitHub (edited)

With this fixes, modify time is now reasonable for Rhodopsin.

Loop time of 545.801 on 1 procs for 10000 steps with 512000 atoms

Performance: 3.166 ns/day, 7.581 hours/ns, 18.322 timesteps/s, 9.381 Matom-step/s
92.0% CPU use with 1 MPI tasks x 1 OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 321.88     | 321.88     | 321.88     |   0.0 | 58.97
Bond    | 18.84      | 18.84      | 18.84      |   0.0 |  3.45
Kspace  | 94.211     | 94.211     | 94.211     |   0.0 | 17.26
Neigh   | 47.806     | 47.806     | 47.806     |   0.0 |  8.76
Comm    | 8.4474     | 8.4474     | 8.4474     |   0.0 |  1.55
Output  | 0.04998    | 0.04998    | 0.04998    |   0.0 |  0.01
Modify  | 38.777     | 38.777     | 38.777     |   0.0 |  7.10
Other   |            | 15.79      |            |       |  2.89

For now, I think we should consider this thread as Resolved.