Kokkos GPU performance for the data Stillinger-Weber(SW)

Dear Lammpsers,

I am evaluating the performance of Lammps Kokkos GPU library with different datasets. I got pretty good performance with the classic lj data. But the performance is poor with some other datasets.

The datasets I am using are in the folder lammps/src/USER-INTEL/TEST. Except in.intel.lj, the performance of other dataset are poor or not good enough. I am using lammps-14May16 which is the latest stable version. I also tried the newer version from github, but got the same poor performance. I am also using CUDA 7.5 and MVAPICH2 2.2rc1 and OpenMPI 2.0.0 compilers. I use in.intel.sw as an example. The following is the performance comparison between different versions for sw.

Compiler MVAPICH2 MVAPICH2 MVAPICH2 OpenMPI

Version MPI GPU Kokkos Kokkos

Pair 342.54 577.36 0.5129 0.5138

Neigh 5.4962 0.3411 47.295 47.299

Comm 30.924 13.097 309.84 311.65

Output 0.0009 0.0079 0.0015 0.0016

Modify 2.6189 28.768 3.8833 3.9189

Other 2.382 16.53 1.233 1.282

The MPI version used 24 processes. GPU is the performance of using GPU library. The Kokkos version used CUDA as device. Both the GPU and Kokkos version used 2 MPI processes and 2 GPUs.

The performance of “Pair” and “Neigh” between GPU and Kokkos vary a lot because I added the command “package kokkos neigh half newton on comm device” in the input script. Before adding this command, the default neigh was “full”, and the performance of “Pair” and “Neigh” in Kokkos are 186.75 and 2.8761, respectively.

Now my question is why the communication cost is so high in Kokkos. Although I used “comm device”, the performance is the same as I use “comm host”. I have already used CUDA-aware MPI compilers, but it seems the communication did not benefit from peer-to-peer access. Has anyone observed the similar issue? Or I need to change some commands in the input script?

To make it easier, the in.intel.sw data is posted as follows:

bulk Si via Stillinger-Weber

package kokkos neigh half newton on comm device

variable w index 10 # Warmup Timesteps

variable t index 6200 # Main Run Timesteps

variable m index 1 # Main Run Timestep Multiplier

variable n index 0 # Use NUMA Mapping for Multi-Node

variable p index 0 # Use Power Measurement

variable x index 2

variable y index 2

variable z index 4

variable xx equal 20*$x

variable yy equal 20*$y

variable zz equal 10*$z

variable rr equal floor($t*$m)

variable root getenv LMP_ROOT

if “$n > 0” then “processors * * * grid numa”

units metal

atom_style atomic

lattice diamond 5.431

region box block 0 {xx} 0 {yy} 0 ${zz}

create_box 1 box

create_atoms 1 box

pair_style sw

pair_coeff * * ${root}/bench/POTENTIALS/Si.sw Si

mass 1 28.06

velocity all create 1000.0 376847 loop geom

neighbor 1.0 bin

neigh_modify delay 5 every 1

fix 1 all nve

thermo 1000

timestep 0.001

if “$p > 0” then “run_style verlet/power”

if “$w > 0” then “run $w”

run ${rr}

Any comments or suggestions are appreciated. Thanks very much.

Best Regards,
Rengan