Device time info with KOKKOS-GPU and performance issue with KOKKOS

I am running a LAMMPS job using Kokkos with 4 K80 GPUs.

  1. I didn’t find any Device Time Info in the output with Kokkos as we get when use the GPU package only, like the following:


In my case, for a 10M atom lj-job, I am getting 4 times sppedup per node than a MPI-only run using the GPU-package. But, when using Kokkos-GPU, I am getting a speedup of factor 2 w.r.t. the MPI-only process.

This means I am losing performance when using KOKKOS. I was expecting more performance gain than the GPU package !

which of the two packages performs better depends on multiple factors: floating point precision settings of the kernels, performance of the CPU subsystem, memory bandwidth, bus interface of the GPU, occupancy of the GPU, amount of data transfer between host and device, number of MPI ranks versus number of threads, processor affinity, and so on. all of these factors impact the performance differently for different simulations and what features are used.

Any suggestion why this happens?

How did you compile the GPU library? with the default setting of mixed precision? or using double precision?
To the best of my knowledge, KOKKOS currently builds using double precision only and that may explain the difference in performance.