[lammps-users] Strange supralinear parallel performance

_Robert_Hoy · January 3, 2009, 11:35pm

Hi. I am running a system of bead-spring polymers with 280000 beads. The pair style is lj/cut/opt with a cutoff of 1.12246 and skin 0.37754. The system is 3d-periodic, I am using NVE integration with a timestep of .01 and a langevin thermostat, and dump thermo quantities every 100 timesteps. My neighboring scheme is set by “neigh_modify every 1 delay 0 check yes”. The only thing non-standard is that I disable the LJ part of the FENE bonds and use “special_bonds 1 1 1”.

The strange thing (and this is by no means a complaint, just something I’d like to understand) is that parallel performance (with the 21 May 08 version of LAMMPS) is markedly supralinear. That is, the system consistently runs more than 8 times faster on 64 processors in a 444 grid than it does on 8 processors in a 222 grid. The ends of the log files are as follows:

8 procs:

Loop time of 109665 on 8 procs for 1500000 steps with 280000 atoms

Pair time () = 35196.4 (32.0944) Bond time () = 9114.23 (8.31096)
Neigh time () = 38935.1 (35.5036) Comm time () = 10513.6 (9.58701)
Outpt time () = 41.1863 (0.0375564) Other time () = 15864.7 (14.4665)

Nlocal: 35000 ave 35044 max 34957 min
Histogram: 1 0 1 2 0 2 0 0 0 2
Nghost: 9942.62 ave 9988 max 9886 min
Histogram: 1 1 1 0 0 1 1 0 1 2
Neighs: 199424 ave 200094 max 198954 min
Histogram: 1 1 3 0 1 0 0 0 1 1

Total # of neighbors = 1595393
Ave neighs/atom = 5.69783
Ave special neighs/atom = 1.99886
Neighbor list builds = 375159
Dangerous builds = 0

and

64 procs:

Loop time of 7070.11 on 64 procs for 1500000 steps with 280000 atoms

Pair time () = 1728.33 (24.4456) Bond time () = 417.089 (5.89932)
Neigh time () = 2329.85 (32.9534) Comm time () = 1314.27 (18.5891)
Outpt time () = 23.5722 (0.333407) Other time () = 1257 (17.7791)

Nlocal: 4375 ave 4411 max 4323 min
Histogram: 2 2 2 5 7 10 19 8 4 5
Nghost: 2704.72 ave 2759 max 2671 min
Histogram: 5 7 12 13 8 8 6 2 1 2
Neighs: 24940.1 ave 25412 max 24456 min
Histogram: 2 1 8 9 9 16 9 2 7 1

Total # of neighbors = 1596167
Ave neighs/atom = 5.7006
Ave special neighs/atom = 1.99886
Neighbor list builds = 375150
Dangerous builds = 0

So, for 8 times the number of processors, the problem runs more than 15 times as fast. How can this be? The ratios of the various components of the run are (time on 8 / time on 64):

Pair: 35196.4/1728.33 (about 20!)
Bond: 9114.23/417.089 (> 20!)
Neigh: 38935.1/2329.85 (~ 15)
Comm: 10513.6/1314.27 (~ 8)
Other: 10513.6/1257 (~8).

So, the parallelization of Comm and Other seems to be ideal, but for pair, bond and neigh it’s somehow “better than ideal” by a factor of at least two. This is certainly impressive, but … any ideas what is going on here? Is this a quirk of the “opt” version of lj/cut?

akohlmey · January 4, 2009, 7:28pm

robert,

So, the parallelization of Comm and Other seems to be ideal, but for pair,
bond and neigh it's somehow "better than ideal" by a factor of at least
two. This is certainly impressive, but ... any ideas what is going on
here? Is this a quirk of the "opt" version of lj/cut?

there are a lot of possible reasons. you didn't give any
description of the hardware/software and interconnect that
you are using. i would speculate that you have a hardware
with comparatively low memory bandwith per PE, e.g. an intel
dual processor quad-core machine. here you can see significant
speedups when your problem set fits better into the CPU cache.
given the system size you describe this is quite possible.

lets make a simple, 'back of the envelope'-type check:

with 280000 beads and an average of 5.7 neighbors per bead
the amount of memory required for atom positions (3 doubles
@ 8byte per bead) per node would be

for 8 PE:
"local" atoms: 280000/8*3*8 = ~820kB
"local+ghost" atoms: 6.7*820kB = ~5.3MB

for 64 PE:
"local" atoms: 280000/64*3*8 = ~120kB
"local+ghost" atoms: 6.7*820kB = ~584kB

now an intel xeon E5430 quad-core cpu has a total
of 6144kB L2 cache or 786kB per PE.

assuming that fast access to the atom positions would
be the dominant operation loading the memory bus, you
can see, that for 8 PE you are clearly outside of what
would fit into the L2 cache, whereas for 64 PE this is
quite possible. since LAMMPS uses neighbor lists, you
have rather irregular memory accesses that benefit the
most from the whole or large parts of the data set fitting
into cache memory.

a nice piecs of software that allows to test and visualize
this kind of effect can be found at:
http://www.cs.inf.ethz.ch/cops/software/

please also note, that not all parts of the calculation
do parallelize equally well. if you make more systematic
scaling tests, you should see that components like
communication and neighborlist builds can interfere with
scaling, thus for larger numbers of processors it can
be faster to e.g. use a larger "skin" and rebuild neighbor
lists less often.

hope this helps,
axel.

p.s.: one more remark on intel quad cores. on our infiniband
cluster it turns out that we can run lammps the fastest, by
using only half the cpus. due to the way those cpus are built
(two cpu cores share one L2 cache), one effectively doubles
the cache per PE and reduces memory and communication bandwidth
on the local busses (you have to share the memory and infiniband
bandwidth per node with only 4 tasks instead of 8) and thus
reduces some non-parallelizable component. if you can use cpu
affinity on top of that you can gain an additional small speedup.

sjplimp · January 5, 2009, 8:29pm

I've seen 10% or so super-linear speed-up due to
running a smaller problem (per processor) in cache, but
not 100%.

Steve

_Robert_Hoy · January 6, 2009, 10:11pm

Thanks, Axel and Steve. So, hardware rears it’s beautiful-in-the-eye-of-the-beholder head. The machine I’m running on has dual processor Xeon nodes (3.6 GHz) and is optimized for large parallel jobs (Steve; it’s tbird). So it’s not the quad-core issue. Steve, if you don’t think it’s a cache issue, what do you think it is? I will be running a lot of long jobs at this system size (280k) and want to minimize total node-hours (for a reasonable Nproc such as 64). Could it be a quirk of lj/cut/opt? The docs say improvement arising from this “opt” depends on system size and processor. I’m happy to try different things and report back if that would be helpful to others, but has anyone determined optimal parameters for running bead-spring on tbird? It’s not on the benchmarks page.

sjplimp · January 7, 2009, 3:15pm

Don't know - I usually don't think too hard when LAMMPS runs
faster than expected. You could test if the "opt" pair style
makes a difference by repeating your 2 runs with the vanilla
style.

Steve