Kokkos GPU performance for the data Stillinger-Weber(SW)

Dear Lammpsers,

I am evaluating the performance of Lammps Kokkos GPU library with different datasets. I got pretty good performance with the classic lj data. But the performance is poor with some other datasets.

The datasets I am using are in the folder lammps/src/USER-INTEL/TEST. Except in.intel.lj, the performance of other dataset are poor or not good enough. I am using lammps-14May16 which is the latest stable version. I also tried the newer version from github, but got the same poor performance. I am also using CUDA 7.5 and MVAPICH2 2.2rc1 and OpenMPI 2.0.0 compilers. I use in.intel.sw as an example. The following is the performance comparison between different versions for sw.


Version MPI GPU Kokkos Kokkos

Pair 342.54 577.36 0.5129 0.5138

Neigh 5.4962 0.3411 47.295 47.299

Comm 30.924 13.097 309.84 311.65

Output 0.0009 0.0079 0.0015 0.0016

Modify 2.6189 28.768 3.8833 3.9189

Other 2.382 16.53 1.233 1.282

The MPI version used 24 processes. GPU is the performance of using GPU library. The Kokkos version used CUDA as device. Both the GPU and Kokkos version used 2 MPI processes and 2 GPUs.

The performance of “Pair” and “Neigh” between GPU and Kokkos vary a lot because I added the command “package kokkos neigh half newton on comm device” in the input script. Before adding this command, the default neigh was “full”, and the performance of “Pair” and “Neigh” in Kokkos are 186.75 and 2.8761, respectively.

Now my question is why the communication cost is so high in Kokkos. Although I used “comm device”, the performance is the same as I use “comm host”. I have already used CUDA-aware MPI compilers, but it seems the communication did not benefit from peer-to-peer access. Has anyone observed the similar issue? Or I need to change some commands in the input script?

To make it easier, the in.intel.sw data is posted as follows:

bulk Si via Stillinger-Weber

package kokkos neigh half newton on comm device

variable w index 10 # Warmup Timesteps

variable t index 6200 # Main Run Timesteps

variable m index 1 # Main Run Timestep Multiplier

variable n index 0 # Use NUMA Mapping for Multi-Node

variable p index 0 # Use Power Measurement

variable x index 2

variable y index 2

variable z index 4

variable xx equal 20*$x

variable yy equal 20*$y

variable zz equal 10*$z

variable rr equal floor($t*$m)

variable root getenv LMP_ROOT

if “$n > 0” then “processors * * * grid numa”

units metal

atom_style atomic

lattice diamond 5.431

region box block 0 {xx} 0 {yy} 0 ${zz}

create_box 1 box

create_atoms 1 box

pair_style sw

pair_coeff * * ${root}/bench/POTENTIALS/Si.sw Si

mass 1 28.06

velocity all create 1000.0 376847 loop geom

neighbor 1.0 bin

neigh_modify delay 5 every 1

fix 1 all nve

thermo 1000

timestep 0.001

if “$p > 0” then “run_style verlet/power”

if “$w > 0” then “run $w”

run ${rr}

Any comments or suggestions are appreciated. Thanks very much.

Best Regards,

Stan may have ideas (CCd).


Have you set MV2_USE_CUDA=1?

Yes, I have set the value of this environment variable as 1.

Has anyone further looked into this issue?

Since I ran the application using 2 MPI processes and 1 K80 (2 internal GPUs). I profiled the application with nvprof. The profiling results are similar on both GPUs. I will show the profiling information on one of the two GPUs:

======== Profiling result:
Time() Time Calls Avg Min Max Name 81.60 427.922s 6202 68.997ms 19.940ms 70.289ms _ZN6Kokkos4Impl36cuda_parallel_launch_constant_memoryINS0_11ParallelForIN9LAMMPS_NS12PairSWKokkosINS_4CudaEEENS_11RangePolicyIJS5_20TagPairSWComputeHalfILi2ELi0EEEEES5_EEEEvv
14.02% 73.5218s 212 346.80ms 341.73ms 347.51ms _ZN6Kokkos4Impl36cuda_parallel_launch_constant_memoryINS0_11ParallelForIN9LAMMPS_NS26NeighborKokkosBuildFunctorINS_4CudaELi0ELi1EEENS_10TeamPolicyIJS5_EEES5_EEEEvv
1.60% 8.40316s 109989 76.399us 1.2470us 12.104ms [CUDA memcpy HtoD]
1.51% 7.89488s 13000 607.30us 2.8470us 19.141ms [CUDA memcpy DtoH]

======== API calls:
Time() Time Calls Avg Min Max Name 84.05 444.743s 23773 18.708ms 3.8980us 88.517ms cudaMemcpy
15.43% 81.6603s 99617 819.74us 1.3070us 486.71ms cudaDeviceSynchronize
0.14% 761.78ms 99210 7.6780us 4.6770us 28.793ms cudaMemcpyToSymbol

The whole session took 535.366 s. Here the profiling result part indicates that one kernel spent the most time, but the API calls part indicates that the data transfer was the most time-consuming part. The lammps output from my previous emails also indicates the the communication took the most time. Does anyone get similar performance results and know why? Thanks.


Version MPI GPU Kokkos Kokkos

Pair 342.54 577.36 0.5129 0.5138

Neigh 5.4962 0.3411 47.295 47.299

Comm 30.924 13.097 309.84 311.65

Output 0.0009 0.0079 0.0015 0.0016

Modify 2.6189 28.768 3.8833 3.9189

Other 2.382 16.53 1.233 1.282

This timing breakdown doesn’t make sense for Kokkos. My guess is that the timer breakdown in Kokkos CUDA has an issue (some of the pair time is bleeding into comm time). If you look at the total time, then Kokkos is faster than both 24 MPI and the GPU package in this case, right? I previously noticed a similar timing issue for the EAM benchmark. Not sure what is causing it and it will probably be a while before I can look into it in depth. For now I would suggest only comparing total simulation time from Kokkos CUDA to gauge performance.



Yes I also think the timer breakdown has some problem in Kokkos CUDA version. But if we compare the total wall clock time, the time of the 24 MPI, the GPU package and kokkos CUDA are 0:06:06, 0:10:38 and 0:06:05, respectively. So the GPU package is the slowest version, and the performance of kokkos CUDA is the same as the 24 MPI on CPU. And based on the profiling result I sent in the previous email, the implementation of the sw pair kernel is not efficient. I tried to debug the application, but unlike GPU package, I could not see the source code of the pair kernel of sw. It seems the CUDA kernel is automatically generated to binary in Kokkos implementation. So is it possible for someone to check the pair kernel and neigh kernel for SW? Thanks.


It is actually your 24-MPI (cpu-only) runs that seem slow to me. If you aren’t supplying different variable settings for the length or size, the time is about 45s on recent Xeon CPUs with sw/intel, so an 8X slow down seems pretty severe, even on older processors.


  • Mike