Can't distribute threads among GPUs

Hey, dear all,

I compiled LAMMPS with KOKKOS support by this cmake option:

"cmake -D PKG_USER-REAXC=yes -D PKG_KOKKOS=yes -D KOKKOS_ARCH=“SNB;Kepler37” -D KOKKOS_ENABLE_CUDA=yes -D KOKKOS_ENABLE_OPENMP=yes -D CMAKE_CXX_COMPILER=/home/zhengyh/software/latest_lammps/lib/kokkos/bin/nvcc_wrapper …/cmake/ "

And when I run

"mpirun -np 4 ~/bin/lmp_kokkos_cuda -k on g 4 -sf kk -pk kokkos newton on neigh half< in.testMPI
"

On a GPU node which has 4 K80s, the 4 threads all run one one GPU core:

"
Sun Jan 5 16:43:16 2020

And this is the output of the execution:

Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
For unit testing set OMP_PROC_BIND=false
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
For unit testing set OMP_PROC_BIND=false
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
For unit testing set OMP_PROC_BIND=false
LAMMPS (7 Aug 2019)
KOKKOS mode is enabled (src/KOKKOS/kokkos.cpp:85)
will use up to 4 GPU(s) per node
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
For unit testing set OMP_PROC_BIND=false
WARNING: Detected MPICH. Disabling CUDA-aware MPI (src/KOKKOS/kokkos.cpp:252)
using 1 OpenMP thread(s) per MPI task
Reading data file …
orthogonal box = (0 0 0) to (51.5041 49.2688 107.205)
1 by 1 by 4 MPI processor grid
reading atoms …
32000 atoms
read_data CPU = 0.130222 secs
16000 atoms in group H
8000 atoms in group C
8000 atoms in group O
Neighbor list info …
update every 10 steps, delay 0 steps, check no
max neighbors/atom: 2000, page size: 100000
master list distance cutoff = 12
ghost atom cutoff = 12
binsize = 12, bins = 5 5 9
2 neighbor lists, perpetual/occasional/extra = 2 0 0
(1) pair reax/c/kk, perpetual
attributes: half, newton off, ghost, kokkos_device
pair build: half/bin/ghost/kk/device
stencil: half/ghost/bin/3d/newtoff
bin: kk/device
(2) fix qeq/reax/kk, perpetual, copy from (1)
attributes: half, newton off, ghost, kokkos_device
pair build: copy/kk/device
stencil: none
bin: none
Setting up Verlet run …
Unit style : real
Current step : 0
Time step : 0.1
WARNING: Fixes cannot yet send data in Kokkos communication, switching to classic communication (src/KOKKOS/comm_kokkos.cpp:493)
Per MPI rank memory allocation (min/avg/max) = 108.7 | 108.9 | 109.2 Mbytes
Step Temp E_pair E_mol TotEng Press
0 0 -2986858.8 0 -2986858.8 -10322.44
50 103.92834 -2996580.3 0 -2986667.3 15444.296
Loop time of 45.6401 on 4 procs for 50 steps with 32000 atoms

Performance: 0.009 ns/day, 2535.559 hours/ns, 1.096 timesteps/s
65.4% CPU use with 4 MPI tasks x 1 OpenMP threads

MPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %total

What kind of MPI are you using and what version of LAMMPS? Sometimes this can happen when we can’t determine what type of MPI you are using. We recently added a check for this, see https://github.com/lammps/lammps/blob/master/src/KOKKOS/kokkos.cpp#L144.