Lammps+kokkos running on single gpu only

While checking the GPU usage properly I realized that kokkos is only using 1 gpu card and not both. Given below is the output of nvidia-smi command, as we can see it using card number 0 for 4 mpi threads and not card 1.
±----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE… Off | 00000000:5E:00.0 Off | 0 |
| N/A 38C P0 65W / 250W | 2420MiB / 16160MiB | 79% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla V100-PCIE… Off | 00000000:86:00.0 Off | 0 |
| N/A 30C P0 24W / 250W | 12MiB / 16160MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 3646 C …/lammps-29Oct20/src/lmp_kokkos_cuda_mpi 601MiB |
| 0 3647 C …/lammps-29Oct20/src/lmp_kokkos_cuda_mpi 603MiB |
| 0 3648 C …/lammps-29Oct20/src/lmp_kokkos_cuda_mpi 603MiB |
| 0 3649 C …/lammps-29Oct20/src/lmp_kokkos_cuda_mpi 601MiB |
±----------------------------------------------------------------------------+
and also giving the same execution time even when I changed the number of gpu each time didn’t getting the actual problem please help

thanks and regards
ranjit

ranjit,

in order to be able to help, you have to provide us with all necessary information to reproduce what you are doing and that includes the exact command line that you are using to run LAMMPS. most likely you are not following what the LAMMPS documentation says you need to do. it is highly unlikely that a bug of this kind would go unnoticed since we have people regularly benchmarking LAMMPS (with and without Kokkos and with and without GPUs) and they notice performance differences of a few percent, while using or not using a GPU would make a far larger difference.

please also note, that the LAMMPS documentation about using the KOKKOS package strongly discourages oversubscribing GPUs. There should be only one MPI rank per GPU. This is different from how the GPU package works.

Axel.

My job script is as follow

#!/bin/sh

#BATCH -N 1
#SBATCH --ntasks-per-node=4
#SBATCH --time=02:50:20
#SBATCH --job-name=lammps_cuda_kokkos
#SBATCH --error=job.%J.err
#SBATCH --output=job.%J.out
#SBATCH --partition=gpu
#SBATCH --gres=gpu:2
###SBATCH --constraint=gpu

module load compiler/intel/2018.2.199
module load compiler/intel-mpi/mpi-2018.2.199
module load compiler/gcc/8.3.0

export PATH=/home/manjunath/kokkos_Lammps/cuda_gpu_lammps/lammps-29Oct20/src:$PATH

export I_MPI_FALLBACK=disable
export I_MPI_HYDRA_PMI_CONNECT=alltoall
export I_MPI_DEBUG=9

export OMP_PROC_BIND=spread
export OMP_PLACES=threads

for two cards

time mpiexec.hydra -n $SLURM_NTASKS /home/manjunath/kokkos_Lammps/cuda_gpu_lammps/lammps-29Oct20/src/lmp_kokkos_cuda_mpi -k on g 2 -sf kk -in in.lj

I build the lammps_kokkos_cuda_mpi with the GPU arch VOLTA70 and I also mention the gencode of the same i.e. sm70 in nvcc_wrapper, I am using Cuda version 10.2 for the same system which I am using have 2 GPU cards per node of Tesla V100
please help me to understand the actual problem.

1 Like

This is not yet conclusive. Can you provide us with the output of that run?

Output with 1 gpu

LAMMPS (29 Oct 2020)
KOKKOS mode is enabled (…/kokkos.cpp:90)
will use up to 1 GPU(s) per node
WARNING: Detected MPICH. Disabling CUDA-aware MPI (…/kokkos.cpp:272)
Lattice spacing in x,y,z = 1.6795962 1.6795962 1.6795962
Created orthogonal box = (0.0000000 0.0000000 0.0000000) to (107.49416 107.49416 107.49416)
2 by 2 by 2 MPI processor grid
Created 1048576 atoms
create_atoms CPU = 0.036 seconds
Neighbor list info …
update every 20 steps, delay 0 steps, check no
max neighbors/atom: 2000, page size: 100000
master list distance cutoff = 2.8
ghost atom cutoff = 2.8
binsize = 2.8, bins = 39 39 39
1 neighbor lists, perpetual/occasional/extra = 1 0 0
(1) pair lj/cut/kk, perpetual
attributes: full, newton off, kokkos_device
pair build: full/bin/kk/device
stencil: full/bin/3d
bin: kk/device
Setting up Verlet run …
Unit style : lj
Current step : 0
Time step : 0.005
Per MPI rank memory allocation (min/avg/max) = 24.45 | 24.45 | 24.45 Mbytes
Step Temp E_pair E_mol TotEng Press
0 1.44 -6.7733681 0 -4.6133701 -5.0196704
100000 0.69306427 -5.6721158 0 -4.6325204 0.71741445
Loop time of 1195.89 on 8 procs for 100000 steps with 1048576 atoms

Performance: 36123.863 tau/day, 83.620 timesteps/s
68.6% CPU use with 8 MPI tasks x 1 OpenMP threads

MPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %total

Pair | 230.94 | 250.65 | 286.46 | 120.3 | 20.96
Neigh | 27.562 | 47.007 | 62.883 | 165.7 | 3.93
Comm | 506.94 | 534.24 | 555.57 | 59.9 | 44.67
Output | 0.0014031 | 0.0043901 | 0.013426 | 6.7 | 0.00
Modify | 316.82 | 356.9 | 373.42 | 89.3 | 29.84
Other | | 7.075 | | | 0.59

Nlocal: 131072.0 ave 131128 max 130962 min
Histogram: 1 0 0 1 1 0 0 1 2 2
Nghost: 45405.1 ave 45449 max 45372 min
Histogram: 1 3 1 0 0 0 0 1 0 2
Neighs: 0.00000 ave 0 max 0 min
Histogram: 8 0 0 0 0 0 0 0 0 0
FullNghs: 9.83080e+06 ave 9.84135e+06 max 9.81286e+06 min
Histogram: 1 0 0 1 1 0 0 2 1 2

Total # of neighbors = 78646384
Ave neighs/atom = 75.003036
Neighbor list builds = 5000
Dangerous builds not checked
Total wall time: 0:19:59


Output with 2 GPUs

LAMMPS (29 Oct 2020)
KOKKOS mode is enabled (…/kokkos.cpp:90)
will use up to 2 GPU(s) per node
WARNING: Detected MPICH. Disabling CUDA-aware MPI (…/kokkos.cpp:272)
Lattice spacing in x,y,z = 1.6795962 1.6795962 1.6795962
Created orthogonal box = (0.0000000 0.0000000 0.0000000) to (107.49416 107.49416 107.49416)
2 by 2 by 2 MPI processor grid
Created 1048576 atoms
create_atoms CPU = 0.036 seconds
Neighbor list info …
update every 20 steps, delay 0 steps, check no
max neighbors/atom: 2000, page size: 100000
master list distance cutoff = 2.8
ghost atom cutoff = 2.8
binsize = 2.8, bins = 39 39 39
1 neighbor lists, perpetual/occasional/extra = 1 0 0
(1) pair lj/cut/kk, perpetual
attributes: full, newton off, kokkos_device
pair build: full/bin/kk/device
stencil: full/bin/3d
bin: kk/device
Setting up Verlet run …
Unit style : lj
Current step : 0
Time step : 0.005
Per MPI rank memory allocation (min/avg/max) = 24.45 | 24.45 | 24.45 Mbytes
Step Temp E_pair E_mol TotEng Press
0 1.44 -6.7733681 0 -4.6133701 -5.0196704
100000 0.69293473 -5.6719934 0 -4.6325923 0.71721185
Loop time of 1193.63 on 8 procs for 100000 steps with 1048576 atoms

Performance: 36192.000 tau/day, 83.778 timesteps/s
68.5% CPU use with 8 MPI tasks x 1 OpenMP threads

MPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %total

Pair | 214.59 | 253.82 | 285.6 | 129.8 | 21.26
Neigh | 43.685 | 49.943 | 55.428 | 50.5 | 4.18
Comm | 487.38 | 529.66 | 568.44 | 110.0 | 44.37
Output | 0.001313 | 0.0047092 | 0.011212 | 5.5 | 0.00
Modify | 326.58 | 353.03 | 364.6 | 60.6 | 29.58
Other | | 7.179 | | | 0.60

Nlocal: 131072.0 ave 131218 max 130943 min
Histogram: 1 0 1 1 2 1 1 0 0 1
Nghost: 45387.2 ave 45502 max 45234 min
Histogram: 1 0 1 1 0 0 1 3 0 1
Neighs: 0.00000 ave 0 max 0 min
Histogram: 8 0 0 0 0 0 0 0 0 0
FullNghs: 9.82991e+06 ave 9.84944e+06 max 9.80783e+06 min
Histogram: 1 0 0 1 2 1 0 2 0 1

Total # of neighbors = 78639318
Ave neighs/atom = 74.996298
Neighbor list builds = 5000
Dangerous builds not checked
Total wall time: 0:19:58

1 Like

Ok, thanks. This all looks reasonable. So there are no obvious mistakes. It will take a while to debug this further since I will first have to build a suitable executable to be run on a multi-GPU node (my development machine has only one GPU).

Ya sure, please let me know if you got the solution of above issue

Thanks
Ranjit

Ok. I have finally gotten hold of a dual GPU machine (2 Tesla P100).
I could not run the stable release, probably because the CUDA version is too new.
However, there was no problem running the latest patch release 8 April 2021.
And it would use both GPUs. Mind you, I was running with 2 MPI ranks only.

So I cannot tell what is causing your troubles. :man_shrugging:
It may be caused by some issues with the local setup.

Thanks for the Reply, could you please provide me build instruction that You follow to install kokkos cuda setup for Lammps It will be helpful to get an issue if any in my setup.

regards
Ranjit

I just used a standard CMake setup as described in the manual. To customize I make a copy of cmake/presets/kokkos_cuda.cmake and edited it for my GPU hardware (Kokkos_ARCH_MAXWELL50Kokkos_ARCH_PASCAL60) and then ran:

cmake -S cmake -B build -C cmake/presets/minimal.cmake -C kokkos-cuda.cmake
cmake --build build
To configure and compile

Axel.

If its possible could you please share it with me

share what?

Refrence of that cmake file or any other refrences that you have in order to setup lammps with kokkos with cuda support actuly I worked in HPC domain I am working on benchmark of Lammps on HPC cluster having more then 200k cpus and thousands of GPUs so it will be helpfull if you have any documentation which will help me to understand lammps working with kokkos

Please read the LAMMPS manual. It is very detailed about how to compile, configure, and use LAMMPS as well as how to use the various accelerator features. It is all there. That is what you should have done anyway.

Lets close this topic now. There is nothing here pointing toward a failure of the KOKKOS package in LAMMPS, and I already explained that it would have been noticed. So my conclusion is that there is something not quite right with your machine setup or how you manage GPU access in batch or something else that I cannot know from remote.

Please also note that you should be running with one MPI rank per GPU unless you are using CUDA MPS. I will only answer to additional questions, if they are: 1) posted as a new topic, 2) in reference to a more specific topic, and 3) showing that you have a good understanding of the relevant information in the LAMMPS manual.

Ok I will figure it out , Thanks for you valuable response