Understanding performance metrics

I’m running a 3 Nov 2022 build on an AMD 8-core (16-thread) machine at home, as well as on the SDSC Expanse supercomputer with more cores available. At SDSC, I’ve run with 16, 32 and 64 cores. The efficiency in each run is about the same, but so is the total running time, so I cannot tell whether adding more cores really makes any difference.

Here is the log summary for 32 cores:

Loop time of 35.5828 on 32 procs for 10000 steps with 17291 atoms

Performance: 121406.855 tau/day, 281.034 timesteps/s
98.7% CPU use with 32 MPI tasks x 1 OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 0.015826   | 0.11495    | 0.25516    |  22.3 |  0.32
Bond    | 0.012184   | 0.15111    | 0.38878    |  31.8 |  0.42
Neigh   | 28.074     | 28.231     | 28.397     |   2.0 | 79.34
Comm    | 4.7558     | 5.7337     | 6.4104     |  21.4 | 16.11
Output  | 0.16269    | 0.16417    | 0.16804    |   0.3 |  0.46
Modify  | 0.03264    | 0.56841    | 1.4784     |  63.3 |  1.60
Other   |            | 0.6192     |            |       |  1.74

Nlocal:        540.344 ave        1422 max          21 min
Histogram: 8 0 8 8 0 0 0 2 2 4
Nghost:        8376.25 ave       12557 max        4886 min
Histogram: 8 0 0 8 8 0 0 0 0 8
Neighs:              0 ave           0 max           0 min
Histogram: 32 0 0 0 0 0 0 0 0 0

The CPU use is about 98% in all three cases, and the “total wall times” are within a few percent of each other.

I note that the simulation involves interactions on the scale of 1 LJ unit, except for a particular set of pairs whose interaction scale is around 15 LJ units. The Neighbor category seems to use the most time, but then I would expect the CPU use factor to go down with more cores.

So I don’t understand how 16, 32, or 64 cores can each be fully utilized, yet take the same total time to run.

I’m missing something.

Please provide the corresponding outputs from all affected runs and also the hardware specs of the machine this was run on.

Please note the %CPU only measures how much the process was able to use the CPU. This is to detect when there are “parasitic” processes on the compute node that impact your access to the CPU and to quantify multi-threading efficiency (if running with 2 threads it should show > 100% and ideally close to 200%, if all major code paths have been properly multi-threaded).

I’ll use the San Diego Supercomputing Center (SDSC) since their use numbers were all more than 98%.

output of “cat /etc/os-release”

NAME="Rocky Linux"
VERSION="8.7 (Green Obsidian)"
ID="rocky"
ID_LIKE="rhel centos fedora"
VERSION_ID="8.7"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Rocky Linux 8.7 (Green Obsidian)"
ANSI_COLOR="0;32"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:rocky:rocky:8:GA"
HOME_URL="https://rockylinux.org/"
BUG_REPORT_URL="https://bugs.rockylinux.org/"
ROCKY_SUPPORT_PRODUCT="Rocky-Linux-8"
ROCKY_SUPPORT_PRODUCT_VERSION="8.7"
REDHAT_SUPPORT_PRODUCT="Rocky Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.7"
[bkeister@login01 ~]$ cat /proc/cpuinfo |less
[bkeister@login01 ~]$ cat /etc/os-release 
NAME="Rocky Linux"
VERSION="8.7 (Green Obsidian)"
ID="rocky"
ID_LIKE="rhel centos fedora"
VERSION_ID="8.7"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Rocky Linux 8.7 (Green Obsidian)"
ANSI_COLOR="0;32"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:rocky:rocky:8:GA"
HOME_URL="https://rockylinux.org/"
BUG_REPORT_URL="https://bugs.rockylinux.org/"
ROCKY_SUPPORT_PRODUCT="Rocky-Linux-8"
ROCKY_SUPPORT_PRODUCT_VERSION="8.7"
REDHAT_SUPPORT_PRODUCT="Rocky Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.7"

output of “head /proc/cpuinfo”:

processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 23
model           : 49
model name      : AMD EPYC 7742 64-Core Processor
stepping        : 0
microcode       : 0x8301055
cpu MHz         : 3379.228
cache size      : 512 KB
physical id     : 0
siblings        : 64
core id         : 0
cpu cores       : 64
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 16
wp              : yes

log summary for 32 cores reported earlier. Log summary for 16 cores:

98.6% CPU use with 16 MPI tasks x 1 OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 0.11912    | 0.21338    | 0.2946     |  11.5 |  0.47
Bond    | 0.090161   | 0.27434    | 0.50137    |  31.3 |  0.61
Neigh   | 37.477     | 37.549     | 37.607     |   0.6 | 83.03
Comm    | 4.3613     | 5.3335     | 6.1905     |  30.4 | 11.79
Output  | 0.12688    | 0.12819    | 0.13137    |   0.3 |  0.28
Modify  | 0.36059    | 1.075      | 1.9496     |  61.5 |  2.38
Other   |            | 0.6486     |            |       |  1.43

Nlocal:        1080.69 ave        1987 max         450 min
Histogram: 8 0 0 0 0 0 1 3 3 1
Nghost:        9966.25 ave       12205 max        7475 min
Histogram: 4 4 0 0 0 0 0 0 0 8
Neighs:              0 ave           0 max           0 min
Histogram: 16 0 0 0 0 0 0 0 0 0

log output for 64 cores:

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 0.0083681  | 0.063756   | 0.24284    |  23.5 |  0.21
Bond    | 0.0034937  | 0.081577   | 0.28525    |  28.7 |  0.26
Neigh   | 22.744     | 23.178     | 23.484     |   4.5 | 75.27
Comm    | 5.7893     | 6.4061     | 6.8501     |  10.3 | 20.80
Output  | 0.17179    | 0.17392    | 0.17988    |   0.4 |  0.56
Modify  | 0.017217   | 0.29176    | 1.0432     |  56.1 |  0.95
Other   |            | 0.5966     |            |       |  1.94

Nlocal:        270.172 ave        1013 max           0 min
Histogram: 32 0 2 7 15 0 1 1 5 1
Nghost:        6837.97 ave       12849 max        2759 min
Histogram: 8 4 20 0 8 16 0 0 0 8
Neighs:              0 ave           0 max           0 min
Histogram: 64 0 0 0 0 0 0 0 0 0

Please note that you have a rather small system. So will only see good parallel scaling for “expensive” force fields.

If you look at the average time spent in the “Pair” and “Bond” section and compare, you can see that you have about 90% parallel efficiency going from 16 to 32 MPI processes and about 80% parallel efficiency when going to 64 MPI processes.

However, your timings are dominated (about 95%) by the time spent on constructing the neighbor lists and communication (which will mostly be due to the neighbor list build as well). That indicates there is something rather unusual with your system setup.

Thanks very much for those comments. What’s perhaps unusual about the setup: it’s a model of chromosomes using a Rouse framework with FENE bonds replacing harmonic bonds. All beads interact via LJ potentials with ranges 1-2 LJ units, except for the link between the centromere (one per chromosome) and the spindle pole body (fixed to the cell nuclear envelope), whose range can be of order 8-15 LJ units. Without a statement like the second of the following:

neighbor 0.5 multi
comm_modify mode single cutoff 20.0

the simulation crashes with “missing bond” messages. I’ve read the comm_modify documentation; it looks like the statement above applies to all pairs, not just the one in question, and I can imagine that would make the run very inefficient, but I could not figure out the parsing of the comm_modify statement so as to restrict the longer cutoff to specific pairs. Or maybe there’s another way to do this.

Yes, there may be.

About 10 years ago, I was helping people in the group of Cristian Micheletti at SISSA in Trieste with a very similar problem. The solution was to not use a bond potential for those long distance bonds but to emulate those with a pair style.

1 Like

I misspoke on this particular bond. I used a “class2” bond rather than FENE, but it’s still a bond. I probably could use a pair potential instead, as long as it minimally mocks up a semi-rigid microtubule. but doesn’t the long range (compared to the other scales) still require special handling?

Please read the documentation for pair style list that I have linked in my previous message.

Often the overhead of building the neighborlist can also be reduced by increasing the neighbor “skin” distance, see neighbor command — LAMMPS documentation. A larger skin means the list will be built less often, though the force computation will be less efficient since it must search more neighbors. In your case though the neigh list time dominates.

Normally I ignore the “% CPU use” output, what really matters is simulation rate and parallel efficiency.

Every single one of your logs says there are zero neighbours in your neighbourlists. Is that meant to happen?

I hadn’t seen that. The full LAMMPS script proceeds by steps:

fix             1 particle nve/limit 0.0001
run 10000
unfix 1
fix             1 particle nve/limit 0.005
run 10000
etc

After the first run, the neighbor list histogram is non-zero, but after subsequent runs, the neighbor log shows zero, as you have noticed. The results themselves seem ‘reasonable’.

Try running for a short time with neigh_modify once yes, and if you get the same trajectory as before, then … something very unusual is happening with your system.

Thanks for that input. Some additional information that I discovered:

The zero neighbor summary occurs, not at the beginning, but after a second run command, in particular, after loading a new set of interactions. I’ve stripped down as much as possible in this interactions file to isolate the issue, and discovered that using hybrid pair_style triggers the zero neighbor output, whereas a single pair_style does not. Here are the code fragments (I’m using lj/cut and lj/sf just to illustrate). Original:

pair_style hybrid lj/cut 5.73 lj/sf 5.73
pair_modify shift yes
pair_coeff 1 1 lj/cut 1 1 1
pair_coeff * 2 lj/cut 1 1 1
pair_coeff 1 3 lj/cut 1 1 1
pair_coeff 2 3 lj/cut 1 1 1
pair_coeff * 4 lj/cut 1 1 1
pair_coeff * 5 lj/cut 0 1 1
pair_coeff * 6 lj/cut 0 1 1

pair_coeff 3 3 lj/sf 1 1 1

Modified code (single pair_style):

pair_style lj/cut 5.73
pair_modify shift yes
pair_coeff 1 1 1 1 1
pair_coeff * 2 1 1 1
pair_coeff 1 3 1 1 1
pair_coeff 2 3 1 1 1
pair_coeff * 4 1 1 1
pair_coeff * 5 0 1 1
pair_coeff * 6 0 1 1

pair_coeff 3 3 1 1 1

I haven’t checked to see whether pair interactions are evaluated properly in the hybrid scenario, only that this results in a zero neighbor summary.

I do appreciate you trying to provide “illustrative” code snippets, but where something as fundamental as neighbor listing is going wrong we need to be looking at the actual scripts you’re using. You might be right about it being pair_style hybrid – or it could be something completely different, or some edge case interaction.

In the meantime, you should try patching your simulations together with a data file – that is, use write_data (with keyword nocoeff) at the end of a run with one pair style, end that script there, and then use LAMMPS read_data to start a new run with the changed paid style.

These are very odd settings. First you define a global cutoff at 5.73 length units, but then you use a individual cutoff at only 1 length unit. That just needlessly blows up the neighbor lists. If you need more ghost atoms, comm_modify cutoff is sufficient. Also, a cutoff of 1 length unit would be strange, even for reduced units, since the minimum of the potential is at 2^{\frac{1}{6}}.

Second strange entry is using pair style lj/sf (side note: the pair style is called lj/smooth officially in the docs) where the inner and outer radius of the smoothing around the cutoff is identical. That is like using lj/cut directly. So why bother?

Third, instead of computing no LJ interactions by setting epsilon to 0.0, why not use the “none” pair style to not have to compute anything?

It is important to not overinterpret the neighbor list histogram info. For starters, it only shows the data for MPI rank 0 and thus it can be quite some difference for individual ranks across large parallel calculations of a not fully homogeneous system. But also one has to understand the mechanism. This is using the functions get_nneigh_half() and get_nneigh_full() in the Neighbor class and look for a suitable regular (i.e. non-skip) pairwise neighbor list. Those are not always found and then there is no output. For pair style hybrid, there may be no such neighbor list request when none of the sub-styles covers all pairs of atom types (usually only happens with hybrid/overlap) and instead an “internal” neighbor list is generated that is a superset of the substyle neighbor lists, but will not match the search requirements.
Also, there will be no neighbor list statistics, if the neighbor list did not need to be updated during the run, i.e. atoms don’t move (much) like in solids.

In other words, there is not much meaning to the fact that you have zero neighbor list statistics output unless you are dead certain that there should be some based on the details of the implementation. It is really authoritative, if there is only one pair style and not pair style hybrid. Its interpretation is certainly not easy without knowing the exact pair style command(s) and the corresponding neighbor list request summary from the beginning of the run output.

1 Like

Thanks, I didn’t know that. Are the Nlocal and Nghost histograms also from proc 0’s data?

Those are histograms from the data across all MPI ranks. You can confirm this by running with just one MPI rank. There is just one histogram slot occupied and ave, min, and max are the same.

Thanks very much for your comments. I am tracking down ways to improve performance considerably.
But I do have a question:

Third, instead of computing no LJ interactions by setting epsilon to 0.0, why not use the “none” pair style to not have to compute anything?

The LAMMPS documentation on pair_style none states:

Using a pair style of none means that any previous pair style setting will be deleted and pairwise forces and energies are not computed… A pair style of none will also not request a pairwise neighbor list.

So it appears that if I want some pair interactions to be zero in some phase of the simulation, but not all, this is not the best option.

In that regard, I tried adding ‘none’ to a hybrid pair_style command, but got an error; then I omitted it from the hybrid list but had a line

pair_coeff 1 2 none

in the script. I got no error message, but I’m now concerned that this line essentially cancelled all non-zero pair_style specifications.

There is a big difference between pair_style none and pair_coeff 1 2 none for pair style hybrid.

That is how it is supposed to be. “none” is not an explicit pair style (“zero” would be), but rather the indication of the absence of a pair style. To better understand this, you need to understand how LAMMPS maps C++ class instances to commands.

First off, this can be easily tested for by constructing specific test inputs, compute forces with LAMMPS, output them to a dump file and then compare to manually computed forces.

What this does (and that is documented behavior!) is to exclude any pair style interactions between pairs of atoms of where one is of type 1 and and the other is of type 2. This is done by excluding those pairs from any of the neighbor lists. Thus this is the most effective way to exclude those. When setting the Lennard-Jones epsilon to 0, you still compute the interactions, but it then turns out to be 0 and thus no forces are added.

Thanks - this is very helpful. I’m now trying to understand ‘none’ with respect to bonds. I have a way around the long-scale tubule issue by limiting centromere to a sphere with an attractive wall. The concept looks promising, but in order to implement, I must first condition the system with long-range bonds, then implement the sphere and get rid of the bonds. This should eliminate large-scale neighbor lists.

I’ve tried various combinations of bond_style none and bond_coeff none, but I get various error messages depending upon what I enter.

Basically I want to go from 3 Class2 bond types to 2 FENE bond types plus nothing for the 3rd. Is there an relevant example somewhere for this?