Parallel Computing Issue with Lammps Simulation

bahman_daneshian1 · October 25, 2023, 1:31pm

Dear LAMMPS expert,

I’ve been working on a Lammps simulation to model the interaction between N2 molecules and a Si surface in parallel. The simulation is being conducted using Lammps version lamps-20220107 on a cluster computer optimized for parallel computing. Each node in this cluster is equipped with 2 Intel(R) Xeon(R) Platinum 8360Y processors (CPU), with each CPU containing 36 processors or cores.

To achieve this, I’ve employed a hybrid pair_style approach combining Tersoff and Lennard Jones potentials as follow:

pair_style hybrid tersoff lj/cut 10.0
pair_coeff * * tersoff Si.tersoff Si NULL
pair_coeff 2 2 lj/cut 0.0416 3.31 8.275
pair_coeff 1 2 lj/cut 0.0428 3.025 7.5625
neigh_modify every 10 delay 0 check yes

After several benchmarking, I’ve encountered an issue with running the code using a hybrid of MPI and OpenMP: the configuration with 144 MPI tasks, 1 CPU per task and 1 OMP thread is yielding better performance compared to any other setups (like 4 MPI tasks and 36 CPUs per task). This suggests that the workload may not benefit as much from the parallelization strategy I initially set up here.

Your insights and suggestions regarding this matter would be greatly appreciated.

The input file and the job description can be seen in the attachment.

Best regards,
Bahman Daneshian
input file (5.7 KB)
lammps_job200.sh (1.6 KB)

akohlmey · October 25, 2023, 1:34pm

You are not providing any information what your “issue” is and how to reproduce it. Thus it is impossible to give any advice beyond that you don’t follow the forum guidelines and are not properly quoting your input file excerpts in triple backquotes (```).

bahman_daneshian1 · October 25, 2023, 2:00pm

Hi. You are right. I have updated my inquiry.

akohlmey · October 25, 2023, 2:09pm

As is documented in the LAMMPS manual, for almost all kinds of systems, especially dense systems, the MPI parallelization in LAMMPS is more efficient than the OpenMP threading. In addition, the OpenMP implementation in LAMMPS is tuned to be particularly efficient for smaller numbers of threads (usually no more than 4, sometimes 8, rarely more like on (now defunct) IBM BlueGene supercomputers).

Please see the responses to How to reduce kspace timing% for more discussions on how to determine and optimize performance in LAMMPS in addition to the manual: 7. Accelerate performance — LAMMPS documentation

bahman_daneshian1 · October 25, 2023, 2:26pm

Thank you for your feedback. Best regards, Bahman Daneshian

bahman_daneshian1 · November 6, 2023, 8:57am

Hi again,

After monitoring of the MPI time breakdown, I added 2 times the “fix 1 all balance Nfreq1.1 rcb” to the program: (1) during minimizing the energy with Nfreq: 200 and (2) during applying the load to the end of the simulation with Nfreq: 1000. Here you can see how did I modified the original code with the fix balance (the original code is in my first inquiry attachment)

#--------------Part7: Equilibration-------------------------
thermo 1000
thermo_style custom step v_avg_disp_y v_avg_sumPE #c_sumPE
timestep 0.005

fix 1 fixed setforce 0.0 0.0 0.0
fix 11 wall setforce 0.0 0.0 0.0
fix 2 mobile nvt temp 300 300 0.1
fix 33 all balance 200 1.1 rcb

dump 1 all custom 1000 dump.${lx}.noN2 id type x y z
min_style cg
minimize 1e-4 1e-6 100 1000
run 3000
unfix 2
unfix 33
fix 4 mobile nve
fix 34 all balance 1000 1.1 rcb

---------------------------------------------------

The purpose was to balance the number of particles and thus the computational cost (load) evenly across processors. I used 520 computer cores (the hardware specification was previously described). But again the MPI task timing breakdown is not good:

Section | min time | avg time | max time |%varavg| %total
---------------------------------------------------------------
Pair | 403.84 | 440.21 | 538.12 | 115.6 | 73.28
Neigh | 10.38 | 14.556 | 24.231 | 122.4 | 2.42
Comm | 4.8041 | 50.204 | 147.01 | 506.6 | 8.36
Output | 0.29548 | 0.99973 | 1.6901 | 40.4 | 0.17
Modify | 22.408 | 91.953 | 149.16 | 396.8 | 15.31
Other | | 2.762 | | | 0.46

Should i change the Nfreq to find a proper value?
Any help is highly appriciated.
Best regards,
Bahman

akohlmey · November 6, 2023, 9:55am

There is not enough information here to make good recommendations. Changing some random settings is not going to help unless you understand where the performance issues are coming from. Possible problems:

the rcb balancing requires the tiled communication pattern which has more overhead than the default.
you use a large number of CPUs. You may be already beyond the strong scaling limit
there is no information about the system geometry. For most adsorption problems a large part of imbalance issues can be addressed by using a “processors * * 1” command. No balance command needed.
the load balancing algorithm by default uses atom counts as weight, but many other options to measure imbalance exist.

Changing the frequency of the rebalancing is a minor concern in this context.

bahman_daneshian1 · November 10, 2023, 11:46am

the rcb balancing requires the tiled communication pattern which has more overhead than the default.
Actually, It seems that rcb balancing reduces 3 times the computaiton times here.
you use a large number of CPUs. You may be already beyond the strong scaling limit
I have to use 9000 MPI tasks (using 125 Nodes): 90 * 100 * 1 MPI task to run the model in 4 hours.
there is no information about the system geometry. For most adsorption problems a large part of imbalance issues can be addressed by using a “processors c * 1” command. No balance command needed.

I have checked that again. Without balancing the computaion time is 3 times more, as mentioned.

the load balancing algorithm by default uses atom counts as weight, but many other options to measure imbalance exist.

Changing the frequency of the rebalancing is a minor concern in this context.
I agree.

How about using another pair_style here? It seems that using Hybrid Tersof/Lj can also be modified. Right?

akohlmey · November 10, 2023, 12:33pm

This is rather useless information. Without having a proper strong scaling plot and some estimate of your parallel efficiency, those numbers have no meaning (except for you alone). They could be indicative of a good performance or a bad performance. Without the context nobody else can tell.

I am confused. The choice of pair style and potential should be determined by the science, not the calculation speed. Furthermore, I don’t understand why you are running such a huge system, when you haven’t even figured out the setup and the reason for the load imbalance that seems to be limiting your performance. This can be investigated much more efficiently with a much smaller system. For certain, none of the LAMMPS developers will take a closer look if they have to run such large systems, and none of us has the time to rebuild your input for better testing. Not to mention that your input file is an example for how you can “abuse” variables. While it may be convenient for you, it makes debugging a major pain, since one always has to look up or try to figure out which value is used where, if one wants to assess such an input file without running it and getting a log file with all the expansions.
As far as I see it, you have already wasted a substantial amount of CPU resources in a rather unproductive fashion and from where I am at, there is very little that I can do at this point beyond the suggestions I have already made.