For the Kokkos package, the only suggestion I have is to use a binsize equal to the neighbor cutoff, i.e. 7.3 for your problem. You can do that from the command line: “-pk kokkos binsize 7.3”. However, I doubt that will tip the balance. What you really need is more GPUs for the Kokkos package to work well. You are really only using 2 CPUs x 2 GPUs. Except in very special cases, Kokkos uses 1 MPI x 1 OpenMP thread per GPU. Your “-t 24” option is NOT giving 24 OpenMP threads, but a single OpenMP thread instead (I need to make that more clear in the docs).