KOKKOS package crash on GPU

Stan_Moore · April 5, 2019, 8:23pm

Naga,

I fixed the crash here https://github.com/lammps/lammps/pull/1415, can you try it out? FYI you are paying a performance penalty because “temp/com” isn’t ported to Kokkos, which is causing a bunch of host <-> device memory transfer. It wouldn’t be hard to port this, it just hasn’t been done yet.

Thanks,

Stan

Naga_Vydyanathan · April 6, 2019, 1:22am

Thanks . Will try it and get back

Naga_Vydyanathan · April 6, 2019, 12:31pm

Thanks,

It works now. However, performance is much worse than the GPU package. I believe this is because of the porting of temp/com you mentioned? Is there a plan to do this sometime soon? Also you were mentioning some performance improvements on KOKKOS, is that version available for download on github?

thanks,
Naga

Naga_Vydyanathan · April 6, 2019, 12:32pm

A quick question - this may be dumb…the porting of temp/com does not impact the LJ benchmark? Asking because I see a performance improvement there with KOKKOS versus GPU.

regards,
Naga

Naga_Vydyanathan · April 9, 2019, 9:05am

Hi ,
There is a whole lot of H2D ad D2H copies happening as you mentioned (>60% of GPU time is spent on these). Is there a plan to port temp/com to KOKKOS?

regards,
Naga

Stan_Moore · April 9, 2019, 10:32pm

Yes, porting compute temp/com is fairly trivial to do. I started but have been working on other things.

One thing you can try is to use CUDA MPS with multiple MPI per GPU, and run comm pack/unpack on host (-pk kokkos comm no) in the meantime.

The new code optimized for small systems is here: https://github.com/lammps/lammps/pull/1422, I am still testing this PR.

Stan

Stan_Moore · April 9, 2019, 10:33pm

The LJ benchmark doesn’t use compute temp/com, so it doesn’t affect the speed.

Stan

Stan_Moore · April 9, 2019, 10:39pm

You could also try running fix npt on the host to avoid the data “ping-ponging” between host and device. See http://lammps.sandia.gov/doc/Speed_kokkos.html for all the details, but you can use fix npt/kk/host in the input script.

Stan

Naga_Vydyanathan · April 11, 2019, 10:45am

Hi Stan,

I tried out both the changes you suggested as yes, it is ~1.6x faster. thanks.
This is still ~4x slower than the GPU package.
Also, sometimes I see the runs crashing with a segmentation fault. If I switch off gpu direct, the segmentation fault does not occur. But, I do have support for gpu direct in both my MPI and the cards.

regards,
Naga

Stan_Moore · April 11, 2019, 2:59pm

Can you send in a reproducer of the segmentation fault?

Stan_Moore · April 11, 2019, 3:04pm

Are you using the code here: https://github.com/lammps/lammps/pull/1422? I’m still working through some issues with that, it isn’t quite ready.

Stan

Naga_Vydyanathan · April 11, 2019, 3:14pm

No, I am using Dec 2018 stable release with only the fix that you provided (pull/1415). Will send you the dump shortly

Naga_Vydyanathan · April 11, 2019, 3:22pm

This is the dump I get, The previous input script (Coexistence_input) is modified for npt/kk/host. MPS is enabled. Btw if I make the communication pack/unpack via host (-pk kokkos comm no), there is no segmentation fault.

[nvydyanathan@…8385… LAMMPS]$ mpirun -np 8 lmp_mpikokkoscuda -k on g 1 -sf kk -in Coexistence_input.small
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
For unit testing set OMP_PROC_BIND=false
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
For unit testing set OMP_PROC_BIND=false
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
For unit testing set OMP_PROC_BIND=false
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
For unit testing set OMP_PROC_BIND=false
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
For unit testing set OMP_PROC_BIND=false
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
For unit testing set OMP_PROC_BIND=false
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
For unit testing set OMP_PROC_BIND=false
LAMMPS (12 Dec 2018)
KOKKOS mode is enabled (src/KOKKOS/kokkos.cpp:77)
will use up to 1 GPU(s) per node
WARNING: Kokkos with CUDA assumes GPU-direct is available, but cannot determine if this is the case
try ‘-pk kokkos gpu/direct off’ when getting segmentation faults (src/KOKKOS/kokkos.cpp:157)
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
For unit testing set OMP_PROC_BIND=false
using 1 OpenMP thread(s) per MPI task
Lattice spacing in x,y,z = 3.52 3.52 3.52
Created orthogonal box = (0 0 0) to (14.08 253.44 506.88)
1 by 2 by 4 MPI processor grid
Created 165888 atoms
Time spent = 0.0184968 secs
Reading potential file Ni_u3.eam with DATE: 2007-06-11
82944 atoms in group liquid
82944 atoms in group solid
WARNING: More than one compute coord/atom (src/compute_coord_atom.cpp:151)
WARNING: More than one compute coord/atom (src/compute_coord_atom.cpp:151)
Neighbor list info …
update every 20 steps, delay 0 steps, check no
max neighbors/atom: 2000, page size: 100000
master list distance cutoff = 6.8
ghost atom cutoff = 6.8
binsize = 3.4, bins = 5 75 150
3 neighbor lists, perpetual/occasional/extra = 1 2 0
(1) pair eam/kk, perpetual
attributes: full, newton off, kokkos_device
pair build: full/bin/kk/device
stencil: full/bin/3d
bin: kk/device
(2) compute coord/atom, occasional
attributes: full, newton off
pair build: full/bin/atomonly
stencil: full/bin/3d
bin: standard
(3) compute coord/atom, occasional
attributes: full, newton off
pair build: full/bin/atomonly
stencil: full/bin/3d
bin: standard
Setting up Verlet run …
Unit style : metal
Current step : 0
Time step : 0.001
Per MPI rank memory allocation (min/avg/max) = 40.24 | 40.24 | 40.24 Mbytes
Step Temp PotEng KinEng TotEng Press Volume Enthalpy
0 1500 -738201.6 32163.867 -706037.73 18993.352 1808768.4 -684595.29
500 924.68769 -717340.98 19827.688 -697513.29 3776.1543 1883830.5 -693073.31
Loop time of 9.71797 on 8 procs for 500 steps with 165888 atoms

Performance: 4.445 ns/day, 5.399 hours/ns, 51.451 timesteps/s
71.9% CPU use with 8 MPI tasks x 1 OpenMP threads

Stan_Moore · April 11, 2019, 3:32pm

This looks like a bug. I’ll see if I can reproduce it–this is for the same input you sent previously, correct? As far as performance goes, are you using multiple MPI ranks per GPU with MPS?

Naga_Vydyanathan · April 12, 2019, 5:28am

Yes, it is the same input file.
Regarding performance, yes, I mapped 8 MPI procs to 1 GPU and tried different combinations too. This gave the best time.

regards,
Naga

Stan_Moore · April 12, 2019, 2:55pm

OK I’ll take another look, probably next week.

Stan

Naga_Vydyanathan · May 16, 2019, 8:30am

Hi Stan,

Is there a KOKKOS port of temp/com available that I can try?

thanks,
Naga