Run time improvement

Dear Lammps experts,

I am writing to ask how I can improve the speed of my simulation.

I am using 13 Nodes (each Node with 48 cores) and 4 threads to model my MEAM pair interactions. I also checked my neighbor list updating as well as balancing the CPU loads as below:

neighbor 2.0 bin # skin distance for metal units is by default 2.0
neigh_modify every 100 delay 0 check yes

fix 5 all balance 0 1.1 rcb

However, according to the time breakdown (below) it seems that communication between CPSs takes a long time. So, any comment is highly appreciated.

image.png

Yours sincerely,

Bahman

Rather than this vague description, you should give the exact command line that you are using and also a more specific description of the node hardware. Please note that the MEAM package has no threading support.

There is no info shown, so it is impossible to comment any further.

Dear Axel,
Thanks for noting that MEAM package has no threading support. I have removed that.

neighbor 2.0 bin
neigh_modify every 100 delay 0 check yes
fix 5 all balance 100 1.1 rcb

ABout the nodes specifications:
Each Node with two sockets equipped with a 24-core 2.1 GHz Intel Xeon Scalable Platinum 8160 processors.

The resulted time breakdown is as follow:

Performance: 0.987 ns/day, 24.310 hours/ns, 11.426 timesteps/s
98.3% CPU use with 624 MPI tasks x 1 OpenMP threads

MPI task timing breakdown:
Section | min time | avg time | max time |%varavg | %total
------------------------------------------------------------------------------------------
Pair | 77.161 | 81.363 | 84.638 | 14.2 | 92.97
Neigh | 0.049149 | 0.062124 | 0.07072 | 1.5 | 0.07
Comm | 2.7023 | 5.9559 | 10.14 | 52.4 | 6.81
Output | 0.0013012 | 0.001353 | 0.0015603 | 0.0 | 0.00
Modify | 0.082773 | 0.10993 | 0.14681 | 4.1 | 0.13
Other | | 0.02507 | | | 0.03

Best regards,
Bahman

I see 93% of the time spent in “Pair” and 7% in Comm. That looks pretty good to me.
The MEAM pair style needs to do a forward and a reverse communication for custom data in every time step in addition to the forward and reverse communication for atoms required by the time integration.

Thank Axel. Now I should increase the number of Nodes. Since, I need to run 100 milion step (fs!).

Best regards,
Bahman

How many atoms does your simulation contain?
There is also such a thing as too many processors. The only true way to know the most efficient setup for your problem is performing scaling benchmarks on your hardware with your input.

I have a changeset for OpenMP support sitting on github, but I’d only recommend that if MPI is not (easily) possible, i.e. in Python library use. It also gives slightly different results, likely due to floating point operation order… “needs more work”, as they say. MPI in LAMMPS scales better anyway, up to a point.

Depending on your problem, you might also get away with reducing the cutoff radius (rc) parameter and save a few interactions, but absolutely test if this still gives satisfactory results (especially if you use 2NN MEAM).
Same goes for increasing the time step, for some applications even 2-5fs is still fine and would cut your simulation by that amount.

Best,
Sebastian

Dear Sebastian,
Your comments were really helpful. I increase the dt from 1 fs to 5 fs and also minimised the system energy with available commands. But I affraid to change the rc. By the way, the speed result is very promising:. It is now more than 3 times faster!

Performance: 3.333 ns/day, 7.201 hours/ns, 7.715 timesteps/s
98.3% CPU use with 432 MPI tasks x 1 OpenMP threads

MPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %total

Pair | 2323.8 | 2447.4 | 2514.9 | 79.6 | 94.41
Neigh | 1.4433 | 1.7795 | 1.9976 | 7.6 | 0.07
Comm | 72.702 | 140.33 | 263.36 | 331.3 | 5.41
Output | 0.04639 | 0.047537 | 0.052279 | 0.4 | 0.00
Modify | 1.5387 | 2.3533 | 3.7024 | 25.0 | 0.09
Other | | 0.424 | | | 0.02

Yours Sincerely,
Bahman