Kspace GPU acceleration possible?

Hello,

I am trying to accelerate my LAMMPS simulations using a GPU, but I am observing that the KSpace portion dominates the runtime and appears to remain on the CPU, limiting performance gains.

System details:

  • LAMMPS version: 2 Apr 2025
  • ~104k atoms (SPC/E water confined between copper walls)
  • boundary p p f with slab correction
  • atom_style full
  • SHAKE constraints on water
  • pair_style lj/cut/coul/long 10.0

KSpace settings:
kspace_style pppm 1.0e-4
kspace_modify slab 3.0

Hardware:

  • NVIDIA GeForce RTX 5090 (28/31 GB, 2.7 GHz)
  • GPU package build (lmp_fortran_mpi_gpu)

Execution command:
export OMP_NUM_THREADS=8
mpirun -np 1 /sw/pub/apps/lammps_fortran_mpi_gpu/bin/lmp_fortran_mpi_gpu \
-sf gpu -pk gpu 1 -in test

Performance breakdown (MPI task timing):
Section | %total

Pair | 2.42
Bond | 1.27
Kspace | 89.51
Neigh | 4.98
Other | 1.66

Observations:

  • KSpace accounts for ~90% of the total runtime
  • Pair interactions (presumably GPU-accelerated) are a small fraction
  • Increasing MPI ranks changes total runtime but does not significantly reduce the KSpace fraction

Questions:

  1. Is PPPM (especially with slab correction) expected to run primarily on the CPU when using the GPU package?
  2. Is there a way to accelerate the KSpace calculations using the GPU (e.g., via KOKKOS or other packages)?
  3. Are there recommended strategies to reduce the KSpace bottleneck for elongated systems with slab correction?

Any guidance would be appreciated.

This is not entirely correct. What you see in the “Performance breakdown” is only the time spent waiting on the CPU. The pair style runs concurrently to Bond and KSpace and seems to be finished before KSpace is done. I suggest you make a run with lj/cut/coul/cut and without KSpace and compare.

Also, rather than looking at percentages, you should look at the absolute time. So to determine how efficient your simulation setup is in general, you should just run without the gpu suffix and compare.

No surprise there. As explained above, you don’t “see” how much time is spent on computing the Pair style because it runs concurrent to KSpace.

yes.

You can try KOKKOS, but it is difficult to predict which approach is more efficient. Your LAMMPS version only supports KOKKOS with full double precision, you would need to upgrade to a recent release to take advantage of mixed precision math which is crucial for consumer GPUs.
Also, KOKKOS requires one GPU per MPI rank and good performance with multiple GPUs depends on a CUDA aware MPI library installation.

In your command line, you are actually leaving a significant performance increase potential on the table by not using OpenMP properly in combination with GPU. For that you would need to do:

mpirun -np 1 lmp -sf hybrid gpu omp -pk gpu 1 -in test

When using multiple MPI processes, it is often beneficient to not use the (partial) GPU support for pppm but run it entirely on the CPU. You can improve the performance balance by shifting more work to the GPU from the KSpace part through carefully increasing the Coulomb cutoff.

Please note that for your system, the cost of KSpace is generally substantial. In PPPM the cost is not dominated by the number of atoms (or atoms per subdomain) but rather by the number of grid points (which depends on the volume and the interpolation order and reciprocal space cutoff or energy/force convergence). With the slab correction, you triple the volume and thus your compotational cost goes up. Please also note that the FFTs scale O(n log(n)) and the required grid transposes for the 3d FFT even worse.