I am writing to ask how I can improve the speed of my simulation.
I am using 13 Nodes (each Node with 48 cores) and 4 threads to model my MEAM pair interactions. I also checked my neighbor list updating as well as balancing the CPU loads as below:
neighbor 2.0 bin # skin distance for metal units is by default 2.0
neigh_modify every 100 delay 0 check yes
fix 5 all balance 0 1.1 rcb
However, according to the time breakdown (below) it seems that communication between CPSs takes a long time. So, any comment is highly appreciated.
Rather than this vague description, you should give the exact command line that you are using and also a more specific description of the node hardware. Please note that the MEAM package has no threading support.
There is no info shown, so it is impossible to comment any further.
I see 93% of the time spent in “Pair” and 7% in Comm. That looks pretty good to me.
The MEAM pair style needs to do a forward and a reverse communication for custom data in every time step in addition to the forward and reverse communication for atoms required by the time integration.
How many atoms does your simulation contain?
There is also such a thing as too many processors. The only true way to know the most efficient setup for your problem is performing scaling benchmarks on your hardware with your input.
I have a changeset for OpenMP support sitting on github, but I’d only recommend that if MPI is not (easily) possible, i.e. in Python library use. It also gives slightly different results, likely due to floating point operation order… “needs more work”, as they say. MPI in LAMMPS scales better anyway, up to a point.
Depending on your problem, you might also get away with reducing the cutoff radius (rc) parameter and save a few interactions, but absolutely test if this still gives satisfactory results (especially if you use 2NN MEAM).
Same goes for increasing the time step, for some applications even 2-5fs is still fine and would cut your simulation by that amount.
Dear Sebastian,
Your comments were really helpful. I increase the dt from 1 fs to 5 fs and also minimised the system energy with available commands. But I affraid to change the rc. By the way, the speed result is very promising:. It is now more than 3 times faster!
Performance: 3.333 ns/day, 7.201 hours/ns, 7.715 timesteps/s
98.3% CPU use with 432 MPI tasks x 1 OpenMP threads
MPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %total