Compiling Kokkos OMP

Benjamin_Jensen · August 16, 2016, 7:56pm

All,

I’ve compiled lammps with kokkos omp successfully (got a binary upon completion). However I suspect lamps is not talking with openmp when I execute the program, based on very slow simulation times (7x slower on the in.lj accelerate example than user-omp, which appears to be working) and because and the lack of openmp threads used at the end of the log file:
“99.8% CPU use with 2 MPI tasks x no OpenMP threads”

This is in contradiction to the beginning of the same log file which states:

“LAMMPS (5 Aug 2016)
KOKKOS mode is enabled (…/kokkos.cpp:38)
using 0 GPU(s)
using 8 OpenMP thread(s) per MPI task”

lammps (5 Aug 2016)
NASA Pleiades cluster (SUSE Linux)
gcc 4.9.3
SGI MPT MPI library

I’ve modified the LAMMPS supplied Makefile.mpi to read:

CC = mpicxx
CCFLAGS = -g -O3 -fopenmp -lmpi -lmpi++
SHFLAGS = -fPIC
DEPFLAGS = -M

LINK = mpicxx
LINKFLAGS = -g -O -fopenmp -lmpi -lmpi++
LIB =
SIZE = size

And am compiling using the command:
python Make.py -j 8 -p none asphere molecule kspace rigid kokkos orig -kokkos omp -o kokkos_omp -a file clean exe -m mpi -v

my execute command is:
mpiexec -ppn 2 lmp_kokkos_omp_20160804 -k on t 8 -sf kk -in in.lj -v t 10 -echo both

echo $OMP_NUM_THREADS in the submit script is correctly showing 8

If this is not a lammps issue I’ll get a hold of Pleiades IT, I just want to du my due-diligence before I got that route.

Thanks.

-Ben

Benjamin_Jensen · August 18, 2016, 11:45am

All,

So what I’m really getting at is should I believe the beginning of the log file where it says:
“KOKKOS mode is enabled (…/kokkos.cpp:38)
using 0 GPU(s)
using 8 OpenMP thread(s) per MPI task”

or the end of the log file where it says:
“99.8% CPU use with 2 MPI tasks x no OpenMP threads”

How many openMP threads/MPI task were used, 0 or 8?

Thanks.

-Ben

akohlmey · August 18, 2016, 12:17pm

All,

So what I'm really getting at is should I believe the beginning of the log
file where it says:
"KOKKOS mode is enabled (../kokkos.cpp:38)
using 0 GPU(s)
using 8 OpenMP thread(s) per MPI task"

or the end of the log file where it says:
"99.8% CPU use with 2 MPI tasks x no OpenMP threads"

How many openMP threads/MPI task were used, 0 or 8?

not entirely straightforward to say. you can disregard the "no OpenMP
threads", as that (currently) refers only to using the USER-OMP
package and that part of the code has not yet been updated to account
for threading via KOKKOS as well.

the message at the beginning only confirms, that you have told the
OpenMP infrastructure to use 8 threads.

however, if those 8 threads would be in active use the ### CPU number would be larger than 100.

so that would mean that,
either a): your are using an input that does not make use of any
KOKKOS enabled styles (i.e. your threads are enabled but idle)
or b): that your MPI library is set up to set single-core processor
affinity for your MPI tasks (your threads are confined to the same CPU
core).

thus best take a simple input, e.g. bench/in.lj that is definitely
KOKKOS enabled and run it with:
- KOKKOS disabled and 1 MPI
- KOKKOS disabled and 2 MPI
- KOKKOS disabled and 4 MPI
- KOKKOS disabled and 8 MPI
- KOKKOS enabled with 1 thread and 1 MPI
- KOKKOS enabled with 2 threads and 1 MPI
- KOKKOS enabled with 4 threads and 1 MPI
- KOKKOS enabled with 8 threads and 1 MPI
- KOKKOS enabled with 1 thread and 2 MPI
- KOKKOS enabled with 2 threads and 2 MPI
- KOKKOS enabled with 4 threads and 2 MPI
- KOKKOS enabled with 1 thread and 4 MPI
- KOKKOS enabled with 2 threads and 4 MPI

and then compare Loop time and %CPU output.

axel.

Benjamin_Jensen · August 18, 2016, 3:12pm

Thanks for the troubleshooting leads Axel. The threads were being assigned to the same CPU core. I’ve revised my mpiexec command to use the SGI MPT option omplace to handle thread placement (omplace is SGI’s wrapper script for dplace). My execute command now reads:
mpiexec omplace lmp_omp20160804 -sf opt -pk omp 8 -in in.lj -echo both

and the log file shows:
781.8% CPU use with 2 MPI tasks x no OpenMP threads

and the simulation is running ~60x faster than without the omplace command.

Thanks.

-Ben

akohlmey · August 18, 2016, 3:33pm

Thanks for the troubleshooting leads Axel. The threads were being assigned
to the same CPU core. I've revised my mpiexec command to use the SGI MPT
option omplace to handle thread placement (omplace is SGI's wrapper script
for dplace). My execute command now reads:
mpiexec omplace lmp_omp20160804 -sf opt -pk omp 8 -in in.lj -echo both

this doesn't make too much sense compared what you initially asked
about. none of that has any relation to KOKKOS. -sf opt will select
/opt styles, which do *not* have threading included or are related to
KOKKOS. the -pk omp 8 will trigger having a 8 thread pool, but they
won't do any computing work, also your output quote suggests, that you
didn't compile/configure the USER-OMP package in the first place (you
should get an error in that case, so i am surprised that this works at
all for you).

and the log file shows:
781.8% CPU use with 2 MPI tasks x no OpenMP threads

and the simulation is running ~60x faster than without the omplace command.

which hints at the thread pool doing busy looping. but not really
using threads for useful work.
so what you were seeing before was that you created a lot of
contention from having the busy thread pool competing with the serial
execution of your computation.

axel.

Benjamin_Jensen · August 18, 2016, 4:22pm

Thanks again Axel. That was both a copy-paste error on my part and the “opt” typo you pointed out. I am comparing kokkos omp to user-omp to try and narrow down my problems. My kokkos command is:
mpiexec omplace lmp_kokkos_omp_20160804 -k on t 8 -sf kk -in in.lj -echo both

and my (fixed) user-omp command is now:
mpiexec omplace lmp_omp20160804 -sf omp -pk omp 8 -in in.lj -echo both

which results in runtimes for the in.lj example:
KOKKOS OMP:
Loop time of 0.451828 on 16 procs for 100 steps with 32000 atoms
778.2% CPU use with 2 MPI tasks x no OpenMP threads

USER-OMP:
Loop time of 0.326399 on 16 procs for 100 steps with 32000 atoms
788.0% CPU use with 2 MPI tasks x 8 OpenMP threads

Thanks.

-Ben