Low CPU Utilization (50%) in LAMMPS Simulation – Optimization Tips?

The top program does not document performance or efficiency, but simply monitors occupancy and thus is not a suitable tool to measure performance.

These are not relevant at all for this.

To optimize computational efficiency, you first have to measure it. But before that, you first have to evaluate the capability of your hardware and for people from the outside to give you advice, you need to have a point of reference.

For that we need to know exactly what hardware you have and what OS you use. E.g. I have an AMD Ryzen 7 7840HS (with 8 cores and 16 threads) and I am running Fedora Linux 42.
This is a single processor desktop machine and not a cluster. If you run on a cluster, we also need to know how many nodes you are using for your test and what kind of interconnect those compute nodes have.

To establish a point of reference, you should first run with one of the examples in the “bench” folder in the LAMMPS distribution. In your case the in.rhodo.scaled input seems proper. With that you can establish a parallel scaling performance profile by running a sequence of parallel runs with varying number of processors (e.g. in my case with 1, 2 , 4, 8, 16) to get parallel efficiency numbers, just like it was demonstrated for in.lj in this post.

The ‘ji’ and ‘rhodo’ examples are “dense” systems and have an optimal shape for the domain decomposition parallelization of LAMMPS and thus load balancing issues won’t happen. In the referenced post, I also increase the system size by replication to make certain, there is no limit of work units.

The Matom-step/s numbers are a somewhat system size independent measure of performance, but those are affected by the details of the force field and the hardware in use. The ‘rhodo’ bench mark uses a slightly different pair style with a shorter cutoff and a rather coarse PPPM convergence. So to get a better comparison, you should adjust those settings to what your choices in your input are.

I just did this and the benchmark is due to using shorter cutoffs and less accurate PPPM about twice as fast (but also less accurate) than a run using your settings ( ~1.4 Matom-step/s vs. ~700 katom-step/s):

$ grep -A3 Loop  rhodo.modified rhodo.scaled 
rhodo.modified:Loop time of 36.6204 on 8 procs for 100 steps with 256000 atoms
rhodo.modified-
rhodo.modified-Performance: 0.472 ns/day, 50.862 hours/ns, 2.731 timesteps/s, 699.065 katom-step/s
rhodo.modified-98.7% CPU use with 8 MPI tasks x 1 OpenMP threads
--
rhodo.scaled:Loop time of 17.7354 on 8 procs for 100 steps with 256000 atoms
rhodo.scaled-
rhodo.scaled-Performance: 0.974 ns/day, 24.632 hours/ns, 5.638 timesteps/s, 1.443 Matom-step/s
rhodo.scaled-99.4% CPU use with 8 MPI tasks x 1 OpenMP threads

For your convenience, I am adding the modified files here:
data.rhodo-modified (6.0 MB)
in.rhodo.modified (721 Bytes)

IMPORTANT NOTE: In both of these cases the occupancy as reported by top or htop is exactly the same.

Another important set is to compare the MPI task timing breakdown. Mine is:

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 12.886     | 12.959     | 13.026     |   1.6 | 73.07
Bond    | 0.45034    | 0.47828    | 0.50014    |   2.7 |  2.70
Kspace  | 0.95031    | 0.99497    | 1.0463     |   3.9 |  5.61
Neigh   | 2.4159     | 2.4172     | 2.4189     |   0.1 | 13.63
Comm    | 0.22548    | 0.22841    | 0.23087    |   0.4 |  1.29
Output  | 0.00027302 | 0.00028145 | 0.00033284 |   0.0 |  0.00
Modify  | 0.6035     | 0.60615    | 0.60797    |   0.2 |  3.42
Other   |            | 0.05123    |            |       |  0.29

This looks quite normal. No significant load imbalance, but the Kspace times is very small. There is possibly some overlapping of computation and communication, which can be avoided by adding the timer sync command at the beginning of the input. This reduces performance, but gives a more accurate timer breakdown and indeed this is the case:

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 16.367     | 16.521     | 16.663     |   2.6 | 45.36
Bond    | 0.38328    | 0.41299    | 0.4264     |   2.0 |  1.13
Kspace  | 12.707     | 12.707     | 12.707     |   0.0 | 34.89
Neigh   | 3.6562     | 3.6562     | 3.6562     |   0.0 | 10.04
Comm    | 0.25294    | 0.25586    | 0.26045    |   0.4 |  0.70
Output  | 0.00023918 | 0.00024607 | 0.00029164 |   0.0 |  0.00
Modify  | 2.3969     | 2.418      | 2.444      |   1.1 |  6.64
Sync    | 0.20918    | 0.37697    | 0.54099    |  18.5 |  1.03
Other   |            | 0.07507    |            |       |  0.21

This looks more like the expected distribution.

Since I have hyper-threading enabled on my machine, I can try out whether using 16 MPI tasks or 8 MPI tasks with 2 OpenMP threads each is faster. The performance data for those two runs are:

rhodo.modified:Loop time of 33.4125 on 16 procs for 100 steps with 256000 atoms
rhodo.modified-
rhodo.modified-Performance: 0.517 ns/day, 46.406 hours/ns, 2.993 timesteps/s, 766.180 katom-step/s
rhodo.modified-97.8% CPU use with 16 MPI tasks x 1 OpenMP threads
--
rhodo.omp:Loop time of 29.2457 on 16 procs for 100 steps with 256000 atoms
rhodo.omp-
rhodo.omp-Performance: 0.591 ns/day, 40.619 hours/ns, 3.419 timesteps/s, 875.343 katom-step/s
rhodo.omp-158.5% CPU use with 8 MPI tasks x 2 OpenMP threads

As you can see there is a small improvement (10% and 25%) and it is larger with MPI+OpenMP.

At this point you can start measuring and documenting the performance of your input deck for your hardware and see whether your performance numbers compare well enough with the reference. Only with that information (and properly summarized and presented), we may be able to provide some meaningful suggestions.

2 Likes