[EXTERNAL] Re: Kokkos GPU performance for the data Stillinger-Weber(SW)

There is actually a reason for this timing breakdown. We are working on overlapping GPU and CPU computation (asynchronous forces) in Kokkos. When using GPUs, the timing breakdown may not be accurate (pair time is reported as comm time). You can disable this overlap by setting the environment variable “CUDA_LAUNCH_BLOCKING=1”. That should give you an accurate timing breakdown but will also be slower since you have to synchronize threads using a barrier. Sorry for the confusion. If you try it out, you should find the comm time much more reasonable when using CUDA_LAUNCH_BLOCKING=1.

Stan

I added a warning for this in the output to alert users.

Stan

Mike, I don’t think this is slower than expected if you don’t use USER-INTEL. It is better than what I get on 24 cores of Sandy Bridge with vanilla LAMMPS.

Rengan, I ran this Stillinger-Weber benchmark (in.intel.sw) with “CUDA_LAUNCH_BLOCKING=1” and “-v m 0.1” so it runs for one-tenth of the original timesteps on 2 internal GPUs of a K80. The comm time is about 10%, which is reasonable (see below). I used “-pk kokkos neigh half newton on binsize 4.77” for Kokkos package options.

Step Temp E_pair E_mol TotEng Press

10 551.81732 -2190612.6 0 -2154092.8 7812.163

630 507.10736 -2187669.7 0 -2154108.8 5763.333

Loop time of 38.805 on 2 procs for 620 steps with 512000 atoms

Performance: 1.380 ns/day, 17.386 hours/ns, 15.977 timesteps/s

85.8% CPU use with 2 MPI tasks x 1 OpenMP threads

MPI task timing breakdown:

Section | min time | avg time | max time |%varavg| %total