Dear Lammpsers,
I am evaluating the performance of Lammps Kokkos GPU library with different datasets. I got pretty good performance with the classic lj data. But the performance is poor with some other datasets.
The datasets I am using are in the folder lammps/src/USER-INTEL/TEST. Except in.intel.lj, the performance of other dataset are poor or not good enough. I am using lammps-14May16 which is the latest stable version. I also tried the newer version from github, but got the same poor performance. I am also using CUDA 7.5 and MVAPICH2 2.2rc1 and OpenMPI 2.0.0 compilers. I use in.intel.sw as an example. The following is the performance comparison between different versions for sw.
Compiler MVAPICH2 MVAPICH2 MVAPICH2 OpenMPI
Version MPI GPU Kokkos Kokkos
Pair 342.54 577.36 0.5129 0.5138
Neigh 5.4962 0.3411 47.295 47.299
Comm 30.924 13.097 309.84 311.65
Output 0.0009 0.0079 0.0015 0.0016
Modify 2.6189 28.768 3.8833 3.9189
Other 2.382 16.53 1.233 1.282
The MPI version used 24 processes. GPU is the performance of using GPU library. The Kokkos version used CUDA as device. Both the GPU and Kokkos version used 2 MPI processes and 2 GPUs.
The performance of “Pair” and “Neigh” between GPU and Kokkos vary a lot because I added the command “package kokkos neigh half newton on comm device” in the input script. Before adding this command, the default neigh was “full”, and the performance of “Pair” and “Neigh” in Kokkos are 186.75 and 2.8761, respectively.
Now my question is why the communication cost is so high in Kokkos. Although I used “comm device”, the performance is the same as I use “comm host”. I have already used CUDA-aware MPI compilers, but it seems the communication did not benefit from peer-to-peer access. Has anyone observed the similar issue? Or I need to change some commands in the input script?
To make it easier, the in.intel.sw data is posted as follows:
bulk Si via Stillinger-Weber
package kokkos neigh half newton on comm device
variable w index 10 # Warmup Timesteps
variable t index 6200 # Main Run Timesteps
variable m index 1 # Main Run Timestep Multiplier
variable n index 0 # Use NUMA Mapping for Multi-Node
variable p index 0 # Use Power Measurement
variable x index 2
variable y index 2
variable z index 4
variable xx equal 20*$x
variable yy equal 20*$y
variable zz equal 10*$z
variable rr equal floor($t*$m)
variable root getenv LMP_ROOT
if “$n > 0” then “processors * * * grid numa”
units metal
atom_style atomic
lattice diamond 5.431
region box block 0 {xx} 0 {yy} 0 ${zz}
create_box 1 box
create_atoms 1 box
pair_style sw
pair_coeff * * ${root}/bench/POTENTIALS/Si.sw Si
mass 1 28.06
velocity all create 1000.0 376847 loop geom
neighbor 1.0 bin
neigh_modify delay 5 every 1
fix 1 all nve
thermo 1000
timestep 0.001
if “$p > 0” then “run_style verlet/power”
if “$w > 0” then “run $w”
run ${rr}
Any comments or suggestions are appreciated. Thanks very much.
Best Regards,
Rengan