Using LAMMPS with CUDA MPS

Naga_Vydyanathan · March 28, 2019, 9:36am

Hi,

Is it possible to use LAMMPS with CUDA MPS enabled? When I try it, I get the following error message:

LAMMPS (12 Dec 2018)
ERROR: Accelerator sharing is not currently supported on system (src/GPU/gpu_extra.h:47)
Last command: package gpu 1

The command I use is
$ mpirun -np 32 lmp_mpigpu -sf gpu -pk gpu 1 -in lammps-12Dec18/bench/in.lj

Could you please let me know how to use LAMMPS with MPS enabled?

thanks,
Naga

akohlmey · March 28, 2019, 10:02am

Doesn’t it have to be: -pk gpu 0 ?

akohlmey · March 28, 2019, 10:27am

please try compiling the GPU library with the define -DCUDA_PROXY added to the CUDR_CPP variable in Makefile.linux (or equivalent).

axel.

Naga_Vydyanathan · March 28, 2019, 2:25pm

I modified Makefile.linux at lib/gpu/ as suggested and built lammps executable using cmake with the following command:
$cmake -D CMAKE_BUILD_TYPE=Release -DBUILD_MPI=yes -DLAMMPS_MACHINE=mpigpu -DPKG_GPU=yes -DGPU_API=cuda -DGPU_ARCH=sm_70 -DPKG_MANYBODY=yes …/cmake

$make
$make install

I am not sure if this is fine as I still get the same error with MPS enabled.

regards,
Naga

akohlmey · March 28, 2019, 3:01pm

the step i suggested applies to the conventional way of building LAMMPS. CMake doesn’t expose this define to users thus far and ignores all makefiles in lib/gpu. you might get lucky with adding -DCMAKE_CXX_FLAGS=-DCUDA_PROXY to the cmake commandline.

axel.

Naga_Vydyanathan · March 29, 2019, 6:50am

Thanks Axel, adding -DCUDA_PROXY to CXX_FLAGS helped and I am able to get lammps running with cuda MPS.
However, I still see the scaling issue as I increase the number of GPUs within a node. As mentioned in my earlier post, the memcopy times have significantly increased, the number of calls remains the same, but the average time is more.
On the other hand, if I limit myself to 1 GPU /node, but use more nodes to use more GPUs, I see better scaling.
For example, with LJ scaled benchmark

8 M atoms on 1 node, with 32 MPI processes mapped to 1 GPU -> 15.9 timesteps/sec
8 M atoms on 2 nodes, 16 MPI processes per node, using 1 GPU per node -> 28.7 timesteps/sec (# MPI procs is set to what gives the best performance for this problem size)

regards,
Naga

Naga_Vydyanathan · March 29, 2019, 5:02pm

I believe this is due to PCI-e being the bottleneck? As GPU package moves only part of the computation to GPU and involves back and forth data transfer between the host and device?
Maybe I should try with KOKKOS CUDA package?

akohlmey · March 29, 2019, 5:56pm

I believe this is due to PCI-e being the bottleneck? As GPU package moves only part of the computation to GPU and involves back and forth data transfer between the host and device?

since you are looking for strong scaling results, actually the GPU package is much better suited than KOKKOS, since KOKKOS is designed for having one MPI rank attached to an accelerator, that usually results in an advantage over the GPU package when having a very large number of atoms and when looking at weak scaling results. please note, that with KOKKOS you also have to transfer data because you need to exchange data between MPI ranks at every step.

what you may run into are general memory throughput bottlenecks. i would organize your benchmarks differently:
first test with 1 MPI rank per GPU and then see, if how well you can scale to multiple GPUs. a DGX-1 has 8, so after running 8 MPI processes across 8 GPUs, repeat this with 2 MPI ranks per GPU. there should be an optimum for each number of GPUs and if you go over, it will slow down.

axel.

akohlmey · March 29, 2019, 6:02pm

i forgot to mention:
what can help on a machine like a DGX-1 when using KOKKOS, is if you have a GPU-direct enabled MPI library. then data transfer performance can be improved.

axel.

Naga_Vydyanathan · March 30, 2019, 2:01am

Ok, will test it out. Yes, we have cuda aware MPI on dgx-1. Does lammps need to be compiled differently to pass device pointers in MPI calls? Does the GPU package leverage cuda aware MPI? Or only KOKkOs?

Thanks,
Naga

akohlmey · March 30, 2019, 2:04am

GPU-direct only makes sense for KOKKOS.

Naga_Vydyanathan · April 5, 2019, 11:47am

The KOKKOS + CUDA package with GPU-direct MPI and cuda mps enabled shows ~20% better scaling efficiency across GPUs within a node. Thanks Axel for the help :).

However, when I run the attached input script, LAMMPS aborts with the following error message:
ERROR: Lost atoms: original 165888 current 0 (src/thermo.cpp:441)

I am running using the following command:
mpirun -np 1 lmp_mpikokkoscuda -k on g 1 -sf kk -in Coexistence_input

and the dump is :

LAMMPS (12 Dec 2018)
KOKKOS mode is enabled (src/KOKKOS/kokkos.cpp:77)
will use up to 1 GPU(s) per node
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
For unit testing set OMP_PROC_BIND=false
using 1 OpenMP thread(s) per MPI task
Lattice spacing in x,y,z = 3.52 3.52 3.52
Created orthogonal box = (0 0 0) to (14.08 253.44 506.88)
1 by 1 by 1 MPI processor grid
Created 165888 atoms
Time spent = 0.131174 secs
Reading potential file Ni_u3.eam with DATE: 2007-06-11
82944 atoms in group liquid
82944 atoms in group solid
WARNING: More than one compute coord/atom (src/compute_coord_atom.cpp:151)
WARNING: More than one compute coord/atom (src/compute_coord_atom.cpp:151)
Neighbor list info …
update every 20 steps, delay 0 steps, check no
max neighbors/atom: 2000, page size: 100000
master list distance cutoff = 6.8
ghost atom cutoff = 6.8
binsize = 3.4, bins = 5 75 150
3 neighbor lists, perpetual/occasional/extra = 1 2 0
(1) pair eam/kk, perpetual
attributes: full, newton off, kokkos_device
pair build: full/bin/kk/device
stencil: full/bin/3d
bin: kk/device
(2) compute coord/atom, occasional
attributes: full, newton off
pair build: full/bin/atomonly
stencil: full/bin/3d
bin: standard
(3) compute coord/atom, occasional
attributes: full, newton off
pair build: full/bin/atomonly
stencil: full/bin/3d
bin: standard
Setting up Verlet run …
Unit style : metal
Current step : 0
Time step : 0.001
Per MPI rank memory allocation (min/avg/max) = 299.8 | 299.8 | 299.8 Mbytes
Step Temp PotEng KinEng TotEng Press Volume Enthalpy
0 1500 -738201.6 32163.867 -706037.73 18993.352 1808768.4 -684595.29
ERROR: Lost atoms: original 165888 current 0 (src/thermo.cpp:441)
Last command: run 200000
Kokkos::Cuda ERROR: Failed to call Kokkos::Cuda::finalize()

Coexistence_input (2.27 KB)

MELT.lammps (3.44 KB)

akohlmey · April 5, 2019, 12:22pm

for new questions (like this one is), please start a new thread with a new subject line. thanks, axel.

Stan_Moore · April 5, 2019, 6:55pm

Naga

ERROR: Accelerator sharing is not currently supported on system (src/GPU/gpu_extra.h:47)

This error is actually due to an issue with the system. The GPUs are configured in “exclusive” mode, but need to be switched to “default”, i.e. shared mode. You can check this on your system using “nvidia-smi”. Most systems (i.e. like ORNL Summit) have a way in the job allocation command to change back and forth between these settings.

Stan

Stan_Moore · April 5, 2019, 6:59pm

FYI we have recently been optimizing the KOKKOS package in LAMMPS for small systems. This can help with strong scaling. These improvements should be released soon.

Stan

Stan_Moore · April 5, 2019, 7:05pm

I’ll take a look, thanks