LAMMPS GPU vs. kokkos/cuda performance

Hi all,

I recently managed to compile LAMMPS with the Kokkos package on my desktop which has a single Nvidia GTX 970. I followed the instructions I found here: http://www.hpcadvisorycouncil.com/pdf/LAMMPS_KOKKOS_Best_Practices.pdf

However, the performance of kokkos/cuda seemed to be much worse than for the gpu package for the standard melt example. The difference was a factor 3 or so. I was wondering if maybe I did something suboptimally. I don’t have access to my desktop right now, so I can’t provide too much info right now, sorry.

Hi all,

I recently managed to compile LAMMPS with the Kokkos package on my desktop
which has a single Nvidia GTX 970. I followed the instructions I found here:
http://www.hpcadvisorycouncil.com/pdf/LAMMPS_KOKKOS_Best_Practices.pdf

i would recommend to not pay much attention to any output from that
source. most whitepapers and benchmark comparisons i have some from
them were commissioned by specific vendors or contributed by people
employed by such vendors and showed a significant, sometimes even
extreme, bias toward specific configurations or setups by that vendor.
in some cases, it was obvious that the authors of the whitepaper must
have spent quite a significant effort on identifying benchmarks that
make their preferred configuration look competitive, when in the
general case it was not.

However, the performance of kokkos/cuda seemed to be much worse than for the
gpu package for the standard melt example. The difference was a factor 3 or
so. I was wondering if maybe I did something suboptimally. I don't have
access to my desktop right now, so I can't provide too much info right now,
sorry.

the KOKKOS and GPU packages have somewhat different performance
characteristics, depending on how they are run (number of MPI vs.
number of number of GPUs vs. number of threads) and the size of the
problem.

i have found recent versions of the KOKKOS package to perform quite
competitive, even in cases where the GPU package has an advantage.

axel.

Stefan

For Kokkos/cuda in general, it is good to use a neighbor binsize equal to the neighbor skin + cutoff, which is 2.8 in LJ units for the standard melt example. That can be accomplished using the package command from the command line: “-pk kokkos binsize 2.8”. The default values of a recent version of the Kokkos package favor a GPU configuration (newton off, full neighbor list, threaded communication), so those shouldn’t need to be changed for standard melt. I would also recommend using only a single MPI task per GPU for the standard melt example since the computation will be running almost entirely on the GPU.

For more complicated manybody potentials running with Kokkos, we have found that a half neighbor list can sometimes be faster than full on the GPU: “-pk kokkos neigh half”. Similarly, if you want to use Kokkos on CPUs with a few OpenMP threads, then you will likely need to change the values of Kokkos package from the default value to get better performance (use newton on and a half neighbor list: “-pk kokkos neigh half newton on”).

Stan

Hi Axel and Stan,

Thanks a lot for your detailed comments. I will give these a try when I have time. My desktop has a GTX 970 and an Intel core i7 6700-K, would you be interested in benchmark results?

would you be interested in benchmark results?

Yes