Hi. I am running a system of bead-spring polymers with 280000 beads. The pair style is lj/cut/opt with a cutoff of 1.12246 and skin 0.37754. The system is 3d-periodic, I am using NVE integration with a timestep of .01 and a langevin thermostat, and dump thermo quantities every 100 timesteps. My neighboring scheme is set by “neigh_modify every 1 delay 0 check yes”. The only thing non-standard is that I disable the LJ part of the FENE bonds and use “special_bonds 1 1 1”.
The strange thing (and this is by no means a complaint, just something I’d like to understand) is that parallel performance (with the 21 May 08 version of LAMMPS) is markedly supralinear. That is, the system consistently runs more than 8 times faster on 64 processors in a 444 grid than it does on 8 processors in a 222 grid. The ends of the log files are as follows:
8 procs:
Loop time of 109665 on 8 procs for 1500000 steps with 280000 atoms
Pair time () = 35196.4 (32.0944)
Bond time () = 9114.23 (8.31096)
Neigh time () = 38935.1 (35.5036)
Comm time () = 10513.6 (9.58701)
Outpt time () = 41.1863 (0.0375564)
Other time () = 15864.7 (14.4665)
Nlocal: 35000 ave 35044 max 34957 min
Histogram: 1 0 1 2 0 2 0 0 0 2
Nghost: 9942.62 ave 9988 max 9886 min
Histogram: 1 1 1 0 0 1 1 0 1 2
Neighs: 199424 ave 200094 max 198954 min
Histogram: 1 1 3 0 1 0 0 0 1 1
Total # of neighbors = 1595393
Ave neighs/atom = 5.69783
Ave special neighs/atom = 1.99886
Neighbor list builds = 375159
Dangerous builds = 0
and
64 procs:
Loop time of 7070.11 on 64 procs for 1500000 steps with 280000 atoms
Pair time () = 1728.33 (24.4456)
Bond time () = 417.089 (5.89932)
Neigh time () = 2329.85 (32.9534)
Comm time () = 1314.27 (18.5891)
Outpt time () = 23.5722 (0.333407)
Other time () = 1257 (17.7791)
Nlocal: 4375 ave 4411 max 4323 min
Histogram: 2 2 2 5 7 10 19 8 4 5
Nghost: 2704.72 ave 2759 max 2671 min
Histogram: 5 7 12 13 8 8 6 2 1 2
Neighs: 24940.1 ave 25412 max 24456 min
Histogram: 2 1 8 9 9 16 9 2 7 1
Total # of neighbors = 1596167
Ave neighs/atom = 5.7006
Ave special neighs/atom = 1.99886
Neighbor list builds = 375150
Dangerous builds = 0
So, for 8 times the number of processors, the problem runs more than 15 times as fast. How can this be? The ratios of the various components of the run are (time on 8 / time on 64):
Pair: 35196.4/1728.33 (about 20!)
Bond: 9114.23/417.089 (> 20!)
Neigh: 38935.1/2329.85 (~ 15)
Comm: 10513.6/1314.27 (~ 8)
Other: 10513.6/1257 (~8).
So, the parallelization of Comm and Other seems to be ideal, but for pair, bond and neigh it’s somehow “better than ideal” by a factor of at least two. This is certainly impressive, but … any ideas what is going on here? Is this a quirk of the “opt” version of lj/cut?