Optimizing LAMMPS for my purposes

CarlosM · November 2, 2023, 1:31pm

Hello there! This is my first time posting here, so fingers crossed I’m in the right place. I’m reasonably new to LAMMPS, and Molecular Dynamics is not my area of expertise, so I appreciate the patience.

I am working on simulating a large number of argon atoms in a 2D fluid with a very low density (1e-6 lj) and high temperature (100 lj). The particles interact via the Lennard Jones Potential. Currently, my script can simulate 1 million atoms, but the simulation takes around 10 minutes to run 1e4 timesteps of size 5e-3. I believe there is room for optimization in my code that could improve these numbers.

LAMMPS is running on my personal device with an NVIDIA GeForce RTX 3070 Ti and an Intel Core i7-12800H (14 cores and 20 logical processors) with 32 GB of RAM. I built LAMMPS with CMake and the most option. I have attached my script to this post and would appreciate any tips to improve performance.

P.S: New users can’t upload files, so I’ve posted my script here echo bothpackage gpu 1 omp 2 device_type nvidiagpuvariable temperature e - Pastebin.com

mkanski · November 2, 2023, 2:40pm

Can you share a log file? Since you’ve posted something, you should be able to add files now.

One thing coming to my mind without seeing the log is that you should run the simulation on, at most, 6 cores, because the rest will slow down the calculations, either because they are “efficient” (in terms of energy used, not power) or hyper-threaded.

CarlosM · November 2, 2023, 7:22pm

Nope, still can’t post.
https://pastebin.com/uXpAuPny
Sidenote, Pastebin flagged it as potentially harmful…
Edit: I could only post it as private, I’m linking here instead.

LAMMPS (2 Aug 2023 - Development - patch_2Aug2023-427-g75682ffbca)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)
  using 1 OpenMP thread(s) per MPI task
package gpu 0
echo both

package gpu 1 omp 2 device_type nvidiagpu

variable	temperature equal 100
variable	tempDamp equal 2.5e-6
variable	density equal 1e-6
variable	particlessqrt equal 1000 # sqrt(number of particles)
variable	seed equal 74581
variable	thermo_out_freq equal 1000
variable	dump_out_freq equal 500
variable	time_step equal 5e-3
variable	thermo_time equal 15e3
variable	run_time	equal 1e4

timer 		timeout 11:55:00 every 1000
New timer settings: style=normal  mode=nosync  timeout=11:55:00
dimension	2
units		lj
atom_style	atomic
# Create square lattice that becomes a gas upon equilibration at high temperature
lattice		sq ${density}
lattice		sq 1e-06
Lattice spacing in x,y,z = 1000 1000 1000
region		box block 0 ${particlessqrt} 0 ${particlessqrt} -0.1 0.1
region		box block 0 1000 0 ${particlessqrt} -0.1 0.1
region		box block 0 1000 0 1000 -0.1 0.1
create_box	1 box
Created orthogonal box = (0 0 -100) to (1000000 1000000 100)
  5 by 2 by 1 MPI processor grid
create_atoms	1 box
Created 1000000 atoms
  using lattice units in orthogonal box = (0 0 -100) to (1000000 1000000 100)
  create_atoms CPU = 0.029 seconds
pair_style	lj/cut/gpu 2.5
pair_coeff	* * 1.0 1.0 2.5
mass		1 1.0
thermo		${thermo_out_freq}
thermo		1000
timestep	${time_step}
timestep	0.005
# default time step is 5e-3
neighbor	500.0		bin

#########################
# Equilibrate
#########################
fix 		1 all nvt/gpu temp  ${temperature} ${temperature} $(285.0*dt)
fix 		1 all nvt/gpu temp  100 ${temperature} $(285.0*dt)
fix 		1 all nvt/gpu temp  100 100 $(285.0*dt)
fix 		1 all nvt/gpu temp  100 100 1.4250000000000000444
#fix		2 all momentum 10000 linear 1 1 1 angular
fix 		3 all enforce2d
velocity        all create ${temperature} ${seed} dist gaussian
velocity        all create 100 ${seed} dist gaussian
velocity        all create 100 74581 dist gaussian
variable fx atom sin(2*PI*y*${density})*sqrt(${temperature})/100/(${thermo_time}*${time_step})
variable fx atom sin(2*PI*y*1e-06)*sqrt(${temperature})/100/(${thermo_time}*${time_step})
variable fx atom sin(2*PI*y*1e-06)*sqrt(100)/100/(${thermo_time}*${time_step})
variable fx atom sin(2*PI*y*1e-06)*sqrt(100)/100/(15000*${time_step})
variable fx atom sin(2*PI*y*1e-06)*sqrt(100)/100/(15000*0.005)
fix 4 all addforce v_fx 0. 0.
run 		${thermo_time}
run 		15000

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE

Your simulation uses code contributions which should be cited:

- GPU package (short-range, long-range and three-body potentials): doi:10.1016/j.cpc.2010.12.021, doi:10.1016/j.cpc.2011.10.012, doi:10.1016/j.cpc.2013.08.002, doi:10.1016/j.commatsci.2014.10.068, doi:10.1016/j.cpc.2016.10.020, doi:10.3233/APC200086

@Article{Brown11,
 author = {W. M. Brown and P. Wang and S. J. Plimpton and A. N. Tharrington},
 title = {Implementing Molecular Dynamics on Hybrid High Performance Computers---Short Range Forces},
 journal = {Comput.\ Phys.\ Commun.},
 year =    2011,
 volume =  182,
 pages =   {898--911},
 doi =     {10.1016/j.cpc.2010.12.021}
}

@Article{Brown12,
 author = {W. M. Brown and A. Kohlmeyer and S. J. Plimpton and A. N. Tharrington},
 title = {Implementing Molecular Dynamics on Hybrid High Performance Computers - Particle-Particle Particle-Mesh},
 journal = {Comput.\ Phys.\ Commun.},
 year =    2012,
 volume =  183,
 doi =     {10.1016/j.cpc.2011.10.012},
 pages =   {449--459}
}

@Article{Brown13,
 author = {W. M. Brown and Y. Masako},
 title = {Implementing Molecular Dynamics on Hybrid High Performance Computers---Three-Body Potentials},
 journal = {Comput.\ Phys.\ Commun.},
 year =    2013,
 volume =  184,
 pages =   {2785--2793},
 doi =     {10.1016/j.cpc.2013.08.002},
}

@Article{Trung15,
 author = {T. D. Nguyen and S. J. Plimpton},
 title = {Accelerating Dissipative Particle Dynamics Simulations for Soft Matter Systems},
 journal = {Comput.\ Mater.\ Sci.},
 year =    2015,
 doi =     {10.1016/j.commatsci.2014.10.068},
 volume =  100,
 pages =   {173--180}
}

@Article{Trung17,
 author = {T. D. Nguyen},
 title = {{GPU}-Accelerated {T}ersoff Potentials for Massively Parallel
    Molecular Dynamics Simulations},
 journal = {Comput.\ Phys.\ Commun.},
 year =    2017,
 doi =     {10.1016/j.cpc.2016.10.020},
 volume =  212,
 pages =   {113--122}
}

@inproceedings{Nikolskiy19,
 author = {V. Nikolskiy and V. Stegailov},
 title = {{GPU} Acceleration of Four-Site Water Models in {LAMMPS}},
 booktitle = {Proceedings of the International Conference on Parallel
    Computing (ParCo 2019), Prague, Czech Republic},
 doi =     {10.3233/APC200086},
 year =    2019
}

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE

Generated 0 of 0 mixed pair_coeff terms from geometric mixing rule
Per MPI rank memory allocation (min/avg/max) = 28.16 | 28.16 | 28.16 Mbytes
   Step          Temp          E_pair         E_mol          TotEng         Press     
         0   100            0              0              99.9999        9.99999e-05  
      1000   100.00008      0              0              99.999982      9.9999982e-05
      2000   100.00038      0              0              100.00028      0.00010000028
      3000   99.997519      0              0              99.997419      9.9997419e-05
      4000   100.01986      0              0              100.01976      0.00010001976
      5000   99.989808      3.3399417e-06  0              99.989711      9.9989747e-05
      6000   99.843516     -1.044226e-06   0              99.843415      9.9843416e-05
      7000   100.0019      -4.3974058e-07  0              100.0018       0.0001000018 
      8000   100.25663      3.203424e-05   0              100.25657      0.00010025677
      9000   100.0668      -1.4849451e-06  0              100.0667       0.0001000667 
     10000   99.944044     -1.0559888e-06  0              99.943943      9.9943942e-05
     11000   99.964048      3.5635778e-06  0              99.963952      9.9963991e-05
     12000   99.988759     -1.9086447e-06  0              99.988657      9.9988654e-05
     13000   100.10312      6.0832521e-08  0              100.10302      0.00010010304
     14000   100.0813       1.838889e-05   0              100.08121      0.00010008135
     15000   100.10201     -1.4708056e-06  0              100.10191      0.00010010191
Loop time of 422.865 on 20 procs for 15000 steps with 1000000 atoms

Performance: 15324.036 tau/day, 35.472 timesteps/s, 35.472 Matom-step/s
197.5% CPU use with 10 MPI tasks x 2 OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 92.415     | 98.635     | 103.16     |  37.7 | 23.33
Neigh   | 0.060491   | 0.07019    | 0.079606   |   1.9 |  0.02
Comm    | 21.621     | 22.472     | 23.17      |  11.6 |  5.31
Output  | 0.032442   | 0.034114   | 0.035846   |   0.5 |  0.01
Modify  | 255.22     | 264.43     | 272.32     |  33.7 | 62.53
Other   |            | 37.22      |            |       |  8.80

Nlocal:         100000 ave      100034 max       99976 min
Histogram: 1 1 1 4 0 0 2 0 0 1
Nghost:          701.6 ave         727 max         673 min
Histogram: 2 1 1 0 0 0 2 1 1 2
Neighs:              0 ave           0 max           0 min
Histogram: 10 0 0 0 0 0 0 0 0 0

Total # of neighbors = 0
Ave neighs/atom = 0
Neighbor list builds = 67
Dangerous builds = 0

reset_timestep	0
#########################
# Production
#########################
unfix		1
#unfix		2
unfix 		4
fix 		5 all print 100 "$(step) $(temp) $(pe) $(ke) $(press) $(density)" file thermo.txt title "# step temp pe ke press density" screen no
fix 		6 all nve/gpu
#dump            myDump all atom ${dump_out_freq} pablo_lj.pos
#dump 			myDump2 all custom ${dump_out_freq} pablo_lj.vel id vx vy
#run 		1000000
dump            myDump all atom ${dump_out_freq} pablo_lj_short.pos
dump            myDump all atom 500 pablo_lj_short.pos
#dump 			myDump2 all custom ${dump_out_freq} pablo_lj_short.vel id vx vy
dump			myDump3 all custom ${dump_out_freq} particles.data id x y vx vy
dump			myDump3 all custom 500 particles.data id x y vx vy
run 		${run_time}
run 		10000
Generated 0 of 0 mixed pair_coeff terms from geometric mixing rule
Per MPI rank memory allocation (min/avg/max) = 24.22 | 24.23 | 24.23 Mbytes
   Step          Temp          E_pair         E_mol          TotEng         Press     
         0   100.10201     -1.4708056e-06  0              100.10191      0.00010010191
      1000   103.79139     -1.3592123e-06  0              103.79128      0.00010379128
      2000   103.84707     -3.7718733e-07  0              103.84696      0.00010384696
      3000   103.85145      1.0619734e-05  0              103.85136      0.00010385145
      4000   103.86236      1.8307881e-06  0              103.86226      0.0001038623 
      5000   103.88417      1.3530946e-05  0              103.88408      0.00010388419
      6000   103.90217     -1.0495417e-06  0              103.90207      0.00010390207
      7000   104.03153     -1.2831594e-06  0              104.03143      0.00010403143
      8000   104.06044      1.0778544e-05  0              104.06035      0.00010406043
      9000   104.08473     -1.9532786e-06  0              104.08462      0.00010408463
     10000   104.08981     -3.01135e-06    0              104.0897       0.00010408971
Loop time of 135.218 on 20 procs for 10000 steps with 1000000 atoms

Performance: 31948.480 tau/day, 73.955 timesteps/s, 73.955 Matom-step/s
195.2% CPU use with 10 MPI tasks x 2 OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 45.86      | 47.607     | 50.611     |  23.3 | 35.21
Neigh   | 0.37639    | 0.39791    | 0.42449    |   2.6 |  0.29
Comm    | 16.843     | 17.826     | 18.493     |  12.0 | 13.18
Output  | 5.7644     | 6.3837     | 6.9683     |  15.4 |  4.72
Modify  | 26.222     | 28.519     | 30.203     |  24.6 | 21.09
Other   |            | 34.48      |            |       | 25.50

Nlocal:         100000 ave      100022 max       99969 min
Histogram: 1 0 1 1 1 0 1 2 1 2
Nghost:          707.7 ave         738 max         664 min
Histogram: 1 0 0 1 2 0 3 0 2 1
Neighs:              0 ave           0 max           0 min
Histogram: 10 0 0 0 0 0 0 0 0 0

Total # of neighbors = 0
Ave neighs/atom = 0
Neighbor list builds = 379
Dangerous builds = 0

write_data	lj.lammps-data
System init for write_data ...
Generated 0 of 0 mixed pair_coeff terms from geometric mixing rule
Total wall time: 0:09:25

srtee · November 2, 2023, 8:31pm

I am not sure how much more performance you expect from your machine. 74 million atom-steps per second (Matom-step/s at the end of the log) is well in range of previous (albeit pretty old) LAMMPS benchmarks on supercomputers.

The MPI/OpenMP breakdown might be tweakable for some further performance gain (either 1 MPI proc of 20 OpenMP threads, 2 procs of 10 threads, 4 of 5, 5 of 4, 10 of 2 – which is where you’re at – or 20 MPI procs and no OpenMP parallelism), but I’d be (pleasantly!) surprised if you got much out of that.

akohlmey · November 2, 2023, 9:12pm

Getting efficient GPU acceleration for pair style lj/cut is about the hardest in MD, since there is so little computational complexity in the 12-6 Lennard-Jones (that is why it was chosen initially for CPUs over the more accurate, but also much more time consuming Morse potential or even a 9-6 Lennard-Jones).

By having a 2-dimensional system, this is even harder, since GPUs need many work units and with a 2d simulation, the number of neighbors within the cutoff grows O(r_{cut}^2) instead of O(r_{cut}^3).

10 MPI ranks for just one GPU is probably overkill. I would probably stop at 4 since your GPU is a good consumer GPU but not a high-end data center GPU. But then again, only some systematic benchmarks can tell. There is very little to improve, though, neighbor list and Pair computation are on the GPU, time integration (with is only O(N) in complexity) is already parallelized with both MPI and OpenMP and there is no gain with Kokkos (requires full double precision and is only effective with one MPI per GPU unless you use the nvidia persistance daemon).

In summary, from my experience @srtee’s comment about your performance being pretty decent for the hardware and choices and that there is very little hope for significant gains.

srtee · November 3, 2023, 9:13pm

Here’s one other thing to think about. If you are running this on a machine with sufficient cooling, there might be a bit more performance available with some planning.

Notice how Communication takes up a significant portion of the calculation time. This suggests that using fewer processors may lead to more efficient calculations. In turn, if you plan your runs so you can do two at a time instead of just one, you should get an overall speedup.

As a concrete example, right now you have 32,000 tau/day on 20 procs. Let’s imagine you can get 20,000 tau/day on 9* procs. If you can now run two simulations at once, instead of one, you will now have accomplished 40,000 tau/day of simulations.

*I suggest 9 procs because you are running a 2D simulation, and a 3 by 3 decomposition of the simulation domain may be optimal (assuming the domain has roughly equal X and y extents).