Performance of LAMMPS with KOKKOS, GPU on multiple GPUs

Greetings,

I would like to ask for your recommendations to achieve a good performance gain with LAMMPS on A100 Nvidia GPUs. Currently, I am limited to only 4 GPUs per node. I have both tried using the latest LAMMPS release version (LAMMPS/02Aug2023) and a pre-installed older version (LAMMPS/23Jun2022 with CUDA-11.4.1) in a cluster I have access to. I observe no significant performance gain with 2 or 4 GPUs than using just one. I am doing benchmarks on the 3d LJ melt from the LAMMPS examples folder (file name: in.melt). I have tried different box sizes ranging from a few thousand atoms to 63 million atoms. I was wondering if you could tell me if the commands I use for running LAMMPS with GPU/KOKKOS packages are correct or if they are the reason why I am not getting better performance with more GPUs.

srun lmp -in in.melt -sf gpu -pk gpu 4 neigh no newton off split -1.0 # for GPU package

mpirun -np 4 --oversubscribe --use-hwthread-cpus --map-by hwthread lmp -in in.melt -k on g 4 -sf kk -pk kokkos neigh full newton off gpu/aware on # for KOKKOS package

Thank you in advance!

Best regards,
M

The GPU package requires at least one MPI process per GPU. So when running in serial - as you do in this command line - there is no benefit to signaling that you have 4 GPUs. LAMMPS will always use only 1 GPU. Thus you have to have at least 4 MPI processes to use 4 GPUs when you have requested just one node. You can also attached 8 MPI processes to the 4 GPUs (two per GPU) as this will give you MPI parallelization for the non-accelerated parts of the code and increases the GPU utilization. Why do you use “neigh no”? For pair style lj/cut using “neigh yes” should be faster, GPU/CPU balance rarely leads to an improvement. There is a point, however, where adding more MPI processes is not helping anymore.

You should definitely get speedup with Kokkos and multiple A100 GPUs if using 63 million atoms. Can you post log files for 1 vs 4 GPUs? I would also bind to core and not hwthread, something like
mpiexec -np 4 --bind-to core

Dear Axel,
Thank you very much for your suggestions. I tried re-running the simulations according to your latest comments and I got double or triple performance gain, compared to what I used to get. Also, I am finally observing better performance for 2 GPUs than for 1.

Sure, please take a look at the following lines.
(Unfortunately, I am not able to upload the files, since I am a new user.)

Below you can check the log file regarding 1 GPU

LAMMPS (2 Aug 2023)
KOKKOS mode is enabled (src/KOKKOS/kokkos.cpp:107)
  will use up to 1 GPU(s) per node
  using 1 OpenMP thread(s) per MPI task
package kokkos
package kokkos neigh full newton off
variable        x index 1
variable        y index 1
variable        z index 1
#variable        t index 100

variable        xx equal 250*$x
variable        xx equal 250*1
variable        yy equal 250*$y
variable        yy equal 250*1
variable        zz equal 250*$z
variable        zz equal 250*1

units           lj
atom_style      atomic

neighbor        0.3 bin
neigh_modify    delay 0 every 20 check no

read_data       inputs/data.eq
Reading data file ...
  orthogonal box = (0 0 0) to (419.89905 419.89905 419.89905)
  1 by 1 by 1 MPI processor grid
  reading atoms ...
  62500000 atoms
  reading velocities ...
  62500000 velocities
  read_data CPU = 319.081 seconds

mass            1 1.0

velocity        all create 1.44 87287 loop geom

pair_style      lj/cut 2.5
pair_coeff      1 1 1.0 1.0 2.5

fix             1 all nve

thermo          1000

run             10000
Generated 0 of 0 mixed pair_coeff terms from geometric mixing rule
Neighbor list info ...
  update: every = 20 steps, delay = 0 steps, check = no
  max neighbors/atom: 2000, page size: 100000
  master list distance cutoff = 2.8
  ghost atom cutoff = 2.8
  binsize = 2.8, bins = 150 150 150
  1 neighbor lists, perpetual/occasional/extra = 1 0 0
  (1) pair lj/cut/kk, perpetual
      attributes: full, newton off, kokkos_device
      pair build: full/bin/kk/device
      stencil: full/bin/3d
      bin: kk/device
Per MPI rank memory allocation (min/avg/max) = 8212 | 8212 | 8212 Mbytes
   Step          Temp          E_pair         E_mol          TotEng         Press     
         0   1.44          -5.6721713      0             -3.5121713      1.3489248    
      1000   1.1320194     -5.2097067      0             -3.5116775      3.3018178    
      2000   1.1313312     -5.2101724      0             -3.5131757      3.2988243    
      3000   1.1307966     -5.2108814      0             -3.5146865      3.2952496    
      4000   1.1301463     -5.2114007      0             -3.5161813      3.2921701    
      5000   1.1295993     -5.2120664      0             -3.5176675      3.288898     
      6000   1.1289812     -5.2126359      0             -3.5191641      3.2857308    
      7000   1.1282893     -5.2130764      0             -3.5206424      3.2831564    
      8000   1.1277573     -5.2137658      0             -3.5221299      3.2793754    
      9000   1.1271463     -5.2143212      0             -3.5236018      3.2760678    
     10000   1.1263523     -5.2146041      0             -3.5250757      3.2739909    
Loop time of 1077.07 on 1 procs for 10000 steps with 62500000 atoms

Performance: 4010.880 tau/day, 9.284 timesteps/s, 580.278 Matom-step/s

99.7% CPU use with 1 MPI tasks x 1 OpenMP threads
MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 54.355     | 54.355     | 54.355     |   0.0 |  5.05
Neigh   | 134.36     | 134.36     | 134.36     |   0.0 | 12.47
Comm    | 25.36      | 25.36      | 25.36      |   0.0 |  2.35
Output  | 0.018137   | 0.018137   | 0.018137   |   0.0 |  0.00
Modify  | 858.12     | 858.12     | 858.12     |   0.0 | 79.67
Other   |            | 4.859      |            |       |  0.45

Nlocal:       6.25e+07 ave    6.25e+07 max    6.25e+07 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Nghost:    2.53415e+06 ave 2.53415e+06 max 2.53415e+06 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Neighs:              0 ave           0 max           0 min
Histogram: 1 0 0 0 0 0 0 0 0 0
FullNghs:  4.23087e+08 ave 4.23087e+08 max 4.23087e+08 min
Histogram: 1 0 0 0 0 0 0 0 0 0

Total # of neighbors = 4.2308691e+08
Ave neighs/atom = 6.7693905
Neighbor list builds = 500
Dangerous builds not checked
Total wall time: 0:23:44

Below you can check the log file regarding 4 GPUs

LAMMPS (2 Aug 2023)
KOKKOS mode is enabled (src/KOKKOS/kokkos.cpp:107)
  will use up to 4 GPU(s) per node
  using 1 OpenMP thread(s) per MPI task
package kokkos
package kokkos neigh full newton off
variable        x index 1
variable        y index 1
variable        z index 1
#variable        t index 100

variable        xx equal 250*$x
variable        xx equal 250*1
variable        yy equal 250*$y
variable        yy equal 250*1
variable        zz equal 250*$z
variable        zz equal 250*1

units           lj
atom_style      atomic

neighbor        0.3 bin
neigh_modify    delay 0 every 20 check no

read_data       inputs/data.eq
Reading data file ...
  orthogonal box = (0 0 0) to (419.89905 419.89905 419.89905)
  1 by 1 by 1 MPI processor grid
  reading atoms ...
  62500000 atoms
  reading velocities ...
  62500000 velocities
  read_data CPU = 315.561 seconds

mass            1 1.0

velocity        all create 1.44 87287 loop geom

pair_style      lj/cut 2.5
pair_coeff      1 1 1.0 1.0 2.5

fix             1 all nve

thermo          1000

run             10000
Generated 0 of 0 mixed pair_coeff terms from geometric mixing rule
Neighbor list info ...
  update: every = 20 steps, delay = 0 steps, check = no
  max neighbors/atom: 2000, page size: 100000
  master list distance cutoff = 2.8
  ghost atom cutoff = 2.8
  binsize = 2.8, bins = 150 150 150
  1 neighbor lists, perpetual/occasional/extra = 1 0 0
  (1) pair lj/cut/kk, perpetual
      attributes: full, newton off, kokkos_device
      pair build: full/bin/kk/device
      stencil: full/bin/3d
      bin: kk/device
Per MPI rank memory allocation (min/avg/max) = 8212 | 8212 | 8212 Mbytes
   Step          Temp          E_pair         E_mol          TotEng         Press     
         0   1.44          -5.6721713      0             -3.5121713      1.3489248    
      1000   1.1320227     -5.2097114      0             -3.5116774      3.3017968    
      2000   1.1313084     -5.2101461      0             -3.5131835      3.2993054    
      3000   1.130643      -5.2106472      0             -3.5146826      3.2961689    
      4000   1.1300122     -5.2111973      0             -3.5161791      3.293421     
      5000   1.1295501     -5.2119935      0             -3.5176684      3.2888873    
      6000   1.1288471     -5.2124251      0             -3.5191544      3.2865217    
      7000   1.1283809     -5.2132169      0             -3.5206456      3.282346     
      8000   1.1278019     -5.2138256      0             -3.5221228      3.2789989    
      9000   1.1272235     -5.2144405      0             -3.5236052      3.2752399    
     10000   1.1265071     -5.2148391      0             -3.5250784      3.2730331    
Loop time of 1067.6 on 1 procs for 10000 steps with 62500000 atoms

Performance: 4046.457 tau/day, 9.367 timesteps/s, 585.425 Matom-step/s
99.6% CPU use with 1 MPI tasks x 1 OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 54.377     | 54.377     | 54.377     |   0.0 |  5.09
Neigh   | 134.36     | 134.36     | 134.36     |   0.0 | 12.59
Comm    | 25.615     | 25.615     | 25.615     |   0.0 |  2.40
Output  | 0.018136   | 0.018136   | 0.018136   |   0.0 |  0.00
Modify  | 848.39     | 848.39     | 848.39     |   0.0 | 79.47
Other   |            | 4.843      |            |       |  0.45

Nlocal:       6.25e+07 ave    6.25e+07 max    6.25e+07 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Nghost:    2.53394e+06 ave 2.53394e+06 max 2.53394e+06 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Neighs:              0 ave           0 max           0 min
Histogram: 1 0 0 0 0 0 0 0 0 0
FullNghs:  4.23079e+08 ave 4.23079e+08 max 4.23079e+08 min
Histogram: 1 0 0 0 0 0 0 0 0 0

Total # of neighbors = 4.2307924e+08
Ave neighs/atom = 6.7692678
Neighbor list builds = 500
Dangerous builds not checked
Total wall time: 0:23:31

In both cases you are only running on a single MPI rank and therefore only using 1 GPU, even though you requested 4 GPUs. You can see that from these lines:

  1 by 1 by 1 MPI processor grid

Loop time of 1067.6 on 1 procs for 10000 steps with 62500000 atoms

What is your mpirun command?

I think you are right. It seems I was only using 1 GPU.
My command was:

mpirun -np 1 lmp -in in.melt -k on g 4 -sf kk -pk kokkos neigh full newton off

Ah yes, you need 4 MPI ranks (1 rank per GPU), something like:

mpirun -np 4 lmp -in in.melt -k on g 4 -sf kk