KOKKOS package crash on GPU

Naga,

I fixed the crash here https://github.com/lammps/lammps/pull/1415, can you try it out? FYI you are paying a performance penalty because “temp/com” isn’t ported to Kokkos, which is causing a bunch of host <-> device memory transfer. It wouldn’t be hard to port this, it just hasn’t been done yet.

Thanks,

Stan

Thanks . Will try it and get back

Thanks,

It works now. However, performance is much worse than the GPU package. I believe this is because of the porting of temp/com you mentioned? Is there a plan to do this sometime soon? Also you were mentioning some performance improvements on KOKKOS, is that version available for download on github?

thanks,
Naga

A quick question - this may be dumb…the porting of temp/com does not impact the LJ benchmark? Asking because I see a performance improvement there with KOKKOS versus GPU.

regards,
Naga

Hi ,
There is a whole lot of H2D ad D2H copies happening as you mentioned (>60% of GPU time is spent on these). Is there a plan to port temp/com to KOKKOS?

regards,
Naga

Yes, porting compute temp/com is fairly trivial to do. I started but have been working on other things.

One thing you can try is to use CUDA MPS with multiple MPI per GPU, and run comm pack/unpack on host (-pk kokkos comm no) in the meantime.

The new code optimized for small systems is here: https://github.com/lammps/lammps/pull/1422, I am still testing this PR.

Stan

The LJ benchmark doesn’t use compute temp/com, so it doesn’t affect the speed.

Stan

You could also try running fix npt on the host to avoid the data “ping-ponging” between host and device. See http://lammps.sandia.gov/doc/Speed_kokkos.html for all the details, but you can use fix npt/kk/host in the input script.

Stan

Hi Stan,

I tried out both the changes you suggested as yes, it is ~1.6x faster. thanks.
This is still ~4x slower than the GPU package.
Also, sometimes I see the runs crashing with a segmentation fault. If I switch off gpu direct, the segmentation fault does not occur. But, I do have support for gpu direct in both my MPI and the cards.

regards,
Naga

Can you send in a reproducer of the segmentation fault?

Are you using the code here: https://github.com/lammps/lammps/pull/1422? I’m still working through some issues with that, it isn’t quite ready.

Stan

No, I am using Dec 2018 stable release with only the fix that you provided (pull/1415). Will send you the dump shortly

This is the dump I get, The previous input script (Coexistence_input) is modified for npt/kk/host. MPS is enabled. Btw if I make the communication pack/unpack via host (-pk kokkos comm no), there is no segmentation fault.

[nvydyanathan@…8385… LAMMPS]$ mpirun -np 8 lmp_mpikokkoscuda -k on g 1 -sf kk -in Coexistence_input.small
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
For unit testing set OMP_PROC_BIND=false
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
For unit testing set OMP_PROC_BIND=false
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
For unit testing set OMP_PROC_BIND=false
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
For unit testing set OMP_PROC_BIND=false
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
For unit testing set OMP_PROC_BIND=false
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
For unit testing set OMP_PROC_BIND=false
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
For unit testing set OMP_PROC_BIND=false
LAMMPS (12 Dec 2018)
KOKKOS mode is enabled (src/KOKKOS/kokkos.cpp:77)
will use up to 1 GPU(s) per node
WARNING: Kokkos with CUDA assumes GPU-direct is available, but cannot determine if this is the case
try ‘-pk kokkos gpu/direct off’ when getting segmentation faults (src/KOKKOS/kokkos.cpp:157)
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
For unit testing set OMP_PROC_BIND=false
using 1 OpenMP thread(s) per MPI task
Lattice spacing in x,y,z = 3.52 3.52 3.52
Created orthogonal box = (0 0 0) to (14.08 253.44 506.88)
1 by 2 by 4 MPI processor grid
Created 165888 atoms
Time spent = 0.0184968 secs
Reading potential file Ni_u3.eam with DATE: 2007-06-11
82944 atoms in group liquid
82944 atoms in group solid
WARNING: More than one compute coord/atom (src/compute_coord_atom.cpp:151)
WARNING: More than one compute coord/atom (src/compute_coord_atom.cpp:151)
Neighbor list info …
update every 20 steps, delay 0 steps, check no
max neighbors/atom: 2000, page size: 100000
master list distance cutoff = 6.8
ghost atom cutoff = 6.8
binsize = 3.4, bins = 5 75 150
3 neighbor lists, perpetual/occasional/extra = 1 2 0
(1) pair eam/kk, perpetual
attributes: full, newton off, kokkos_device
pair build: full/bin/kk/device
stencil: full/bin/3d
bin: kk/device
(2) compute coord/atom, occasional
attributes: full, newton off
pair build: full/bin/atomonly
stencil: full/bin/3d
bin: standard
(3) compute coord/atom, occasional
attributes: full, newton off
pair build: full/bin/atomonly
stencil: full/bin/3d
bin: standard
Setting up Verlet run …
Unit style : metal
Current step : 0
Time step : 0.001
Per MPI rank memory allocation (min/avg/max) = 40.24 | 40.24 | 40.24 Mbytes
Step Temp PotEng KinEng TotEng Press Volume Enthalpy
0 1500 -738201.6 32163.867 -706037.73 18993.352 1808768.4 -684595.29
500 924.68769 -717340.98 19827.688 -697513.29 3776.1543 1883830.5 -693073.31
Loop time of 9.71797 on 8 procs for 500 steps with 165888 atoms

Performance: 4.445 ns/day, 5.399 hours/ns, 51.451 timesteps/s
71.9% CPU use with 8 MPI tasks x 1 OpenMP threads

MPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %total

This looks like a bug. I’ll see if I can reproduce it–this is for the same input you sent previously, correct? As far as performance goes, are you using multiple MPI ranks per GPU with MPS?

Yes, it is the same input file.
Regarding performance, yes, I mapped 8 MPI procs to 1 GPU and tried different combinations too. This gave the best time.

regards,
Naga

OK I’ll take another look, probably next week.

Stan

Hi Stan,

Is there a KOKKOS port of temp/com available that I can try?

thanks,
Naga