Performance of LAMMPS with KOKKOS, GPU on multiple GPUs

mustafa.smith · November 22, 2023, 1:52pm

Greetings,

I would like to ask for your recommendations to achieve a good performance gain with LAMMPS on A100 Nvidia GPUs. Currently, I am limited to only 4 GPUs per node. I have both tried using the latest LAMMPS release version (LAMMPS/02Aug2023) and a pre-installed older version (LAMMPS/23Jun2022 with CUDA-11.4.1) in a cluster I have access to. I observe no significant performance gain with 2 or 4 GPUs than using just one. I am doing benchmarks on the 3d LJ melt from the LAMMPS examples folder (file name: in.melt). I have tried different box sizes ranging from a few thousand atoms to 63 million atoms. I was wondering if you could tell me if the commands I use for running LAMMPS with GPU/KOKKOS packages are correct or if they are the reason why I am not getting better performance with more GPUs.

srun lmp -in in.melt -sf gpu -pk gpu 4 neigh no newton off split -1.0 # for GPU package

mpirun -np 4 --oversubscribe --use-hwthread-cpus --map-by hwthread lmp -in in.melt -k on g 4 -sf kk -pk kokkos neigh full newton off gpu/aware on # for KOKKOS package

Thank you in advance!

Best regards,
M

akohlmey · November 22, 2023, 2:03pm

The GPU package requires at least one MPI process per GPU. So when running in serial - as you do in this command line - there is no benefit to signaling that you have 4 GPUs. LAMMPS will always use only 1 GPU. Thus you have to have at least 4 MPI processes to use 4 GPUs when you have requested just one node. You can also attached 8 MPI processes to the 4 GPUs (two per GPU) as this will give you MPI parallelization for the non-accelerated parts of the code and increases the GPU utilization. Why do you use “neigh no”? For pair style lj/cut using “neigh yes” should be faster, GPU/CPU balance rarely leads to an improvement. There is a point, however, where adding more MPI processes is not helping anymore.

stamoor · November 22, 2023, 3:43pm

You should definitely get speedup with Kokkos and multiple A100 GPUs if using 63 million atoms. Can you post log files for 1 vs 4 GPUs? I would also bind to core and not hwthread, something like
mpiexec -np 4 --bind-to core

mustafa.smith · November 22, 2023, 4:43pm

Dear Axel,
Thank you very much for your suggestions. I tried re-running the simulations according to your latest comments and I got double or triple performance gain, compared to what I used to get. Also, I am finally observing better performance for 2 GPUs than for 1.

mustafa.smith · November 22, 2023, 4:48pm

Sure, please take a look at the following lines.
(Unfortunately, I am not able to upload the files, since I am a new user.)

Below you can check the log file regarding 1 GPU

LAMMPS (2 Aug 2023)
KOKKOS mode is enabled (src/KOKKOS/kokkos.cpp:107)
  will use up to 1 GPU(s) per node
  using 1 OpenMP thread(s) per MPI task
package kokkos
package kokkos neigh full newton off
variable        x index 1
variable        y index 1
variable        z index 1
#variable        t index 100

variable        xx equal 250*$x
variable        xx equal 250*1
variable        yy equal 250*$y
variable        yy equal 250*1
variable        zz equal 250*$z
variable        zz equal 250*1

units           lj
atom_style      atomic

neighbor        0.3 bin
neigh_modify    delay 0 every 20 check no

read_data       inputs/data.eq
Reading data file ...
  orthogonal box = (0 0 0) to (419.89905 419.89905 419.89905)
  1 by 1 by 1 MPI processor grid
  reading atoms ...
  62500000 atoms
  reading velocities ...
  62500000 velocities
  read_data CPU = 319.081 seconds

mass            1 1.0

velocity        all create 1.44 87287 loop geom

pair_style      lj/cut 2.5
pair_coeff      1 1 1.0 1.0 2.5

fix             1 all nve

thermo          1000

run             10000
Generated 0 of 0 mixed pair_coeff terms from geometric mixing rule
Neighbor list info ...
  update: every = 20 steps, delay = 0 steps, check = no
  max neighbors/atom: 2000, page size: 100000
  master list distance cutoff = 2.8
  ghost atom cutoff = 2.8
  binsize = 2.8, bins = 150 150 150
  1 neighbor lists, perpetual/occasional/extra = 1 0 0
  (1) pair lj/cut/kk, perpetual
      attributes: full, newton off, kokkos_device
      pair build: full/bin/kk/device
      stencil: full/bin/3d
      bin: kk/device
Per MPI rank memory allocation (min/avg/max) = 8212 | 8212 | 8212 Mbytes
   Step          Temp          E_pair         E_mol          TotEng         Press     
         0   1.44          -5.6721713      0             -3.5121713      1.3489248    
      1000   1.1320194     -5.2097067      0             -3.5116775      3.3018178    
      2000   1.1313312     -5.2101724      0             -3.5131757      3.2988243    
      3000   1.1307966     -5.2108814      0             -3.5146865      3.2952496    
      4000   1.1301463     -5.2114007      0             -3.5161813      3.2921701    
      5000   1.1295993     -5.2120664      0             -3.5176675      3.288898     
      6000   1.1289812     -5.2126359      0             -3.5191641      3.2857308    
      7000   1.1282893     -5.2130764      0             -3.5206424      3.2831564    
      8000   1.1277573     -5.2137658      0             -3.5221299      3.2793754    
      9000   1.1271463     -5.2143212      0             -3.5236018      3.2760678    
     10000   1.1263523     -5.2146041      0             -3.5250757      3.2739909    
Loop time of 1077.07 on 1 procs for 10000 steps with 62500000 atoms

Performance: 4010.880 tau/day, 9.284 timesteps/s, 580.278 Matom-step/s

99.7% CPU use with 1 MPI tasks x 1 OpenMP threads
MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 54.355     | 54.355     | 54.355     |   0.0 |  5.05
Neigh   | 134.36     | 134.36     | 134.36     |   0.0 | 12.47
Comm    | 25.36      | 25.36      | 25.36      |   0.0 |  2.35
Output  | 0.018137   | 0.018137   | 0.018137   |   0.0 |  0.00
Modify  | 858.12     | 858.12     | 858.12     |   0.0 | 79.67
Other   |            | 4.859      |            |       |  0.45

Nlocal:       6.25e+07 ave    6.25e+07 max    6.25e+07 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Nghost:    2.53415e+06 ave 2.53415e+06 max 2.53415e+06 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Neighs:              0 ave           0 max           0 min
Histogram: 1 0 0 0 0 0 0 0 0 0
FullNghs:  4.23087e+08 ave 4.23087e+08 max 4.23087e+08 min
Histogram: 1 0 0 0 0 0 0 0 0 0

Total # of neighbors = 4.2308691e+08
Ave neighs/atom = 6.7693905
Neighbor list builds = 500
Dangerous builds not checked
Total wall time: 0:23:44

Below you can check the log file regarding 4 GPUs

LAMMPS (2 Aug 2023)
KOKKOS mode is enabled (src/KOKKOS/kokkos.cpp:107)
  will use up to 4 GPU(s) per node
  using 1 OpenMP thread(s) per MPI task
package kokkos
package kokkos neigh full newton off
variable        x index 1
variable        y index 1
variable        z index 1
#variable        t index 100

variable        xx equal 250*$x
variable        xx equal 250*1
variable        yy equal 250*$y
variable        yy equal 250*1
variable        zz equal 250*$z
variable        zz equal 250*1

units           lj
atom_style      atomic

neighbor        0.3 bin
neigh_modify    delay 0 every 20 check no

read_data       inputs/data.eq
Reading data file ...
  orthogonal box = (0 0 0) to (419.89905 419.89905 419.89905)
  1 by 1 by 1 MPI processor grid
  reading atoms ...
  62500000 atoms
  reading velocities ...
  62500000 velocities
  read_data CPU = 315.561 seconds

mass            1 1.0

velocity        all create 1.44 87287 loop geom

pair_style      lj/cut 2.5
pair_coeff      1 1 1.0 1.0 2.5

fix             1 all nve

thermo          1000

run             10000
Generated 0 of 0 mixed pair_coeff terms from geometric mixing rule
Neighbor list info ...
  update: every = 20 steps, delay = 0 steps, check = no
  max neighbors/atom: 2000, page size: 100000
  master list distance cutoff = 2.8
  ghost atom cutoff = 2.8
  binsize = 2.8, bins = 150 150 150
  1 neighbor lists, perpetual/occasional/extra = 1 0 0
  (1) pair lj/cut/kk, perpetual
      attributes: full, newton off, kokkos_device
      pair build: full/bin/kk/device
      stencil: full/bin/3d
      bin: kk/device
Per MPI rank memory allocation (min/avg/max) = 8212 | 8212 | 8212 Mbytes
   Step          Temp          E_pair         E_mol          TotEng         Press     
         0   1.44          -5.6721713      0             -3.5121713      1.3489248    
      1000   1.1320227     -5.2097114      0             -3.5116774      3.3017968    
      2000   1.1313084     -5.2101461      0             -3.5131835      3.2993054    
      3000   1.130643      -5.2106472      0             -3.5146826      3.2961689    
      4000   1.1300122     -5.2111973      0             -3.5161791      3.293421     
      5000   1.1295501     -5.2119935      0             -3.5176684      3.2888873    
      6000   1.1288471     -5.2124251      0             -3.5191544      3.2865217    
      7000   1.1283809     -5.2132169      0             -3.5206456      3.282346     
      8000   1.1278019     -5.2138256      0             -3.5221228      3.2789989    
      9000   1.1272235     -5.2144405      0             -3.5236052      3.2752399    
     10000   1.1265071     -5.2148391      0             -3.5250784      3.2730331    
Loop time of 1067.6 on 1 procs for 10000 steps with 62500000 atoms

Performance: 4046.457 tau/day, 9.367 timesteps/s, 585.425 Matom-step/s
99.6% CPU use with 1 MPI tasks x 1 OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 54.377     | 54.377     | 54.377     |   0.0 |  5.09
Neigh   | 134.36     | 134.36     | 134.36     |   0.0 | 12.59
Comm    | 25.615     | 25.615     | 25.615     |   0.0 |  2.40
Output  | 0.018136   | 0.018136   | 0.018136   |   0.0 |  0.00
Modify  | 848.39     | 848.39     | 848.39     |   0.0 | 79.47
Other   |            | 4.843      |            |       |  0.45

Nlocal:       6.25e+07 ave    6.25e+07 max    6.25e+07 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Nghost:    2.53394e+06 ave 2.53394e+06 max 2.53394e+06 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Neighs:              0 ave           0 max           0 min
Histogram: 1 0 0 0 0 0 0 0 0 0
FullNghs:  4.23079e+08 ave 4.23079e+08 max 4.23079e+08 min
Histogram: 1 0 0 0 0 0 0 0 0 0

Total # of neighbors = 4.2307924e+08
Ave neighs/atom = 6.7692678
Neighbor list builds = 500
Dangerous builds not checked
Total wall time: 0:23:31

stamoor · November 22, 2023, 5:28pm

In both cases you are only running on a single MPI rank and therefore only using 1 GPU, even though you requested 4 GPUs. You can see that from these lines:

  1 by 1 by 1 MPI processor grid

Loop time of 1067.6 on 1 procs for 10000 steps with 62500000 atoms

What is your mpirun command?

mustafa.smith · November 22, 2023, 6:17pm

I think you are right. It seems I was only using 1 GPU.
My command was:

mpirun -np 1 lmp -in in.melt -k on g 4 -sf kk -pk kokkos neigh full newton off

stamoor · November 22, 2023, 6:28pm

Ah yes, you need 4 MPI ranks (1 rank per GPU), something like:

mpirun -np 4 lmp -in in.melt -k on g 4 -sf kk