Kokkos Lammps on GPU machine - negative scaling from 1 to 2 nodes

Dear all,

I have been running some test of Kokkos Lammps on a machine with the following characteristics:
Cores per node: 32
Sockets per node 2
Hyperthreading: 4 per core
GPUs per node: 4

Lammps version: LAMMPS (15 Apr 2020)

The relevant slurm directives are the following:
#SBATCH --ntasks-per-node=4
#SBATCH --ntasks-per-socket=2
#SBATCH --cpus-per-task=32
#SBATCH --gres=gpu:4
#SBATCH --mem=50000

The number of OMP threads is 16 (export OMP_NUM_THREADS=16)

I ran a short simulation of 20000 time steps on 1 and 2 nodes to check the performance.
For the calculation on 1 node, I used the following command line:

time mpirun -gpu -np 4 --map-by socket:PE=8 --rank-by core lmp_kokkos_cuda_mpi_omp -k on t 16 g 4 -sf kk -in run.in

This resulted in the following MPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %CPU | %total

please note, that in both cases, you time spent in Kspace is dominating. that is often an indication that your system
is far too small for the number of MPI ranks in use. Kspace parallelization is tricky. It can only be done in 2d and not in 3d like with domain decomposition and then you need 6 ffts with 4 transposes plus steps to pack/unpack the data from the domain decomposition distribution to the “pencil” or “stick” distribution that the FFTs need. While the FFTs scale quasilinear, the communication overhead grows exponentially.

this can to some degree be compensated by using a larger cutoff, but for small systems it is then best to only use one MPI rank, try the GPU package instead of KOKKOS (and run Kspace on the CPU and perhaps using multi-threading or verlet/split time integration).
axel.

but for small systems it is then best to only use one MPI rank, try the GPU package instead of KOKKOS (and run Kspace on the CPU and perhaps using multi-threading or verlet/split time integration).

With Kokkos, you can run KSpace on the CPU with multi-threading as well if you use “kspace_style pppm/kk/host”. See https://lammps.sandia.gov/doc/Speed_kokkos.html, section Using OpenMP threading and CUDA together. It will also overlap the pair and kspace computation, but you need to compile with this flag in CCFLAGS: “–default-stream
per-thread​”.

Stan