[lammps-users] Improving communication and building of neighbour lists for a dense and confined Lennard-Jones fluid

Hello Lammps users,

I study the statics and dynamics properties of one or several polymer(s) in various geometries (free space, slit, and cylinder) at different volume fractions of Lennard-Jones (LJ) particles (crowders) at a reduced temperature equivalent to T=300K. To span a range of volume fraction from 0 to 0.45 at a fixed set of other parameters, I change the number of crowders between 0 to ~250000. I set r_cut=1.6sigma (where sigma=1) so I have purely repulsive Lennard-Jones (WCA) interaction among all the particles (monomers and crowders). Polymerization number N is in the range of 50-3000. A polymer chain can be heterogenous, so its monomers can have size between a_mon=1sigma and 10sigma (so we have different r_cut for different monomers). All the crowders creating the fluid have the size sigma=1. A typical system is run for 10^6 timesteps for equilibration, and then it runs for 10^8 for data sampling. The Langevin thermostat is used to run simulation with fix NVE (see attached unit file and log file a test simulation).

I used this setup for building of neighbour lists in the two following test runs:

neighbor 0.2 bin
neigh_modify delay 0 every 1 check yes one 300 page 3000

I got this summary for a chain of length N=2000 with a_mon=sigma and 39000 crowders of size a_crd=sigma in a cylindrical confinement (PBC along the axis of symmetry):

Run name: N2000epsilon5.0r10.5lz100.0sig1.0nc39000ens1

Loop time of 2012.52 on 32 procs for 5000000 steps with 41900 atoms

Performance: 1073283.905 tau/day, 2484.453 timesteps/s
86.2% CPU use with 32 MPI tasks x 1 OpenMP threads

MPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %total

Pair | 129.02 | 134.35 | 141.52 | 33.9 | 6.68
Bond | 0.8848 | 7.9447 | 17.355 | 228.5 | 0.39
Neigh | 411.79 | 425.92 | 441.76 | 43.8 | 21.16
Comm | 469.04 | 525.47 | 585.83 | 144.9 | 26.11
Output | 14.212 | 16.203 | 23.934 | 43.7 | 0.81
Modify | 736.52 | 792.64 | 851.14 | 110.7 | 39.39
Other | | 110 | | | 5.46

Nlocal: 1309.38 ave 1394 max 1246 min
Histogram: 2 5 4 7 0 4 4 3 2 1
Nghost: 565.594 ave 625 max 529 min
Histogram: 4 5 3 4 4 7 3 1 0 1
Neighs: 1125.06 ave 1227 max 1021 min
Histogram: 2 0 3 5 6 7 2 4 0 3

Total # of neighbors = 36002
Ave neighs/atom = 0.859236
Ave special neighs/atom = 0.0954177
Neighbor list builds = 1114837
Dangerous builds = 0
System init for write_restart …
Total wall time: 5:46:28

Or, this one for a chain of length N=80 with a_mon=sigma and ~79000 crowders of size a_crd=0.2sigma in a cylindrical geometry (PBC along the axis of symmetry):

Run name: N80epsilon5.0r3.5lz29.25sig0.2nc78624ens1
Loop time of 21186.1 on 32 procs for 5000000 steps with 78704 atoms

Performance: 40781.384 tau/day, 236.003 timesteps/s
77.0% CPU use with 32 MPI tasks x 1 OpenMP threads

MPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %total

Pair | 401.08 | 820.37 | 1599.5 |1301.8 | 3.87
Bond | 1.1507 | 2.2343 | 4.5161 | 71.9 | 0.01
Neigh | 13070 | 13233 | 13390 | 83.9 | 62.46
Comm | 3710.3 | 4199.8 | 4884.8 | 438.7 | 19.82
Output | 30.826 | 34.533 | 50.451 | 66.3 | 0.16
Modify | 1898.4 | 2544.4 | 3208.8 | 750.2 | 12.01
Other | | 352.2 | | | 1.66

Nlocal: 2459.5 ave 3641 max 1398 min
Histogram: 16 0 0 0 0 0 0 1 6 9
Nghost: 7992.56 ave 10665 max 5447 min
Histogram: 8 8 0 0 0 0 0 8 0 8
Neighs: 17812.8 ave 28692 max 9476 min
Histogram: 16 0 0 0 0 0 1 4 7 4

Total # of neighbors = 570010
Ave neighs/atom = 7.24245
Ave special neighs/atom = 0.00200752
Neighbor list builds = 481583
Dangerous builds = 0
Total wall time: 57:32:50

I changes the setup to the following one for the below simulation:

neighbor 0.2 bin
neigh_modify every 1 delay 0 check yes page 200000

Here, a chain of N=100 with a_mon=sigma and 5sigma and ~124000 crowders of size a_crd=1.0sigma in free space (a cube):

Run name: N100dl5.0nl2l30dc1.0nc123759dtbdump2000adump20000ens1
Loop time of 16734.9 on 32 procs for 5000000 steps with 123859 atoms

Performance: 51628.506 tau/day, 298.776 timesteps/s
99.8% CPU use with 32 MPI tasks x 1 OpenMP threads

MPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %total

Pair | 962.92 | 989.79 | 1138.8 | 100.3 | 5.91
Bond | 0.9434 | 1.746 | 4.1282 | 76.4 | 0.01
Neigh | 9956.5 | 10111 | 10160 | 41.7 | 60.42
Comm | 3045.6 | 3314.9 | 3558 | 268.8 | 19.81
Output | 125.85 | 126.21 | 127.03 | 1.9 | 0.75
Modify | 1625.1 | 1854.3 | 2109.3 | 352.8 | 11.08
Other | | 336.8 | | | 2.01

Nlocal: 3870.59 ave 3918 max 3828 min
Histogram: 2 4 3 3 5 5 1 6 1 2
Nghost: 13057.5 ave 13119 max 12978 min
Histogram: 1 2 3 0 4 4 8 6 3 1
Neighs: 9996.34 ave 10337 max 9764 min
Histogram: 4 3 4 5 5 2 5 3 0 1

Total # of neighbors = 319883
Ave neighs/atom = 2.58264
Ave special neighs/atom = 0.00159859
Neighbor list builds = 498549
Dangerous builds = 0
Total wall time: 55:18:44

Since I used loop functionality of Lammps to split dump files to chunks in the 1st and 3rd simulations above, so the statistics for them is for 1 loop out of 10; the statistic for other loops is almost similar.

As you can see, more the ~50% of simulation time is used for communication and building of neighbour lists. How can I reduce these two times?

The lammps version is 2019-08-07.

Please find attached the log file and lammps input I used for the last simulation.

Thanks for your help,
Amir

N100dl5.0nl2l30dc1.0nc123759dtbdump2000adump20000ens4.log (467 KB)

N100dl5.0nl2l30dc1.0nc123759dtbdump2000adump20000ens1.data (7.62 KB)

N100dl5.0nl2l30dc1.0nc123759dtbdump2000adump20000ens1.lmp (4.57 KB)

N80epsilon5.0r3.5lz29.25sig0.2nc78624ens1.data (6.59 KB)

N80epsilon5.0r3.5lz29.25sig0.2nc78624ens1.lmp (4.57 KB)

N80epsilon5.0r3.5lz29.25sig0.2nc78624ens1.log (400 KB)

to fully assess the situation, we need some additional information describing the hardware of the machine that you are running on, specifically:

  • what CPUs do those machines have?
  • how many sockets per node? how many cores per socket?
  • does it have hyper-threading enabled? and are they counted as cores?
  • what is the network type, if running on multiple nodes?
  • do you have exclusive use of the node(s) you are running on?

axel.

FYI, I have made some experiments with your input deck on a local large shared memory server (to minimize the impact of the communication hardware).
here are some observations:

  • since you use a repulsive-only potential, i.e. a very short cutoff, you have very few neighbors per atom and thus the time spent in the computing of Pair has to be rather short. this will artificially increase the relative cost of everything else, including the neighbor list processing and communication.
  • you have a load balancing problem due to a significant change to the box dimensions after the read_data command. the distribution of MPI ranks to subdomains is set when creating the (initial) box and thus based on wrong assumptions. you can compensate for that by using the processors command. due to the shape of the system, you should be using: either processors 1 1 * or processors 2 2 * with the latter only being available when the number of MPI ranks is divisible by 4.
  • your neighbor list parameters are OK but can be slightly optimized to neigh_modify delay 4 every 2 check yes
  • you may want to try avoiding to use fix recenter
  • when using hyper-threading there is a significant slowdown
  • 32 CPUs seems to be on the high end for a system with less than 100,000 atoms, i.e. you are already approaching the strong scaling limit.

bottom line: there are some optimizations in the 20-30% range possible, but the basic picture won’t change as that is due to the model itself.

axel.

Dear Axel,

I use the Graham cluster (https://docs.computecanada.ca/wiki/Graham) from Compute Canada. This is the map of the Graham cluster :https://docs.computecanada.ca/mediawiki/images/b/b3/Gp3-network-topo.png
I just copy-paste some relevant information for Graham wiki page:

  • what CPUs do those machines have?

Mostly, Intel E5-2683 v4 Broadwell @ 2.1GHz

  • how many sockets per node? how many cores per socket?
  • does it have hyper-threading enabled? and are they counted as cores?
  • what is the network type, if running on multiple nodes?

The CPU nodes are connected by Mellanox EDR (100Gb/s) InfiniBand interconnect. A central 324-port director switch aggregates connections from islands of 1024 cores each for CPU and GPU nodes.” “A low-latency high-bandwidth Infiniband fabric connects all nodes and scratch storage.”

  • do you have exclusive use of the node(s) you are running on?
    I submit my jobs via the Slurm scheduler.

Thanks for your support.
Amir

ok. that fits with the numbers that I am getting on some newer hardware. so what I suggested in my other e-mail are pretty much what you can do. for the most part there is not much to do if you have so few neighbors per atom outside of reducing overhead from fixes and dump styles.

I would recommend sticking with using a single node. there is not much strong scaling performance left on the table after a single node with 32-cpu cores for this system size. and since they are physical cores there are no worries about hyper-threading.

axel.

Dear Axel,

FYI, I have made some experiments with your input deck on a local large shared memory server (to minimize the impact of the communication hardware).
here are some observations:

  • since you use a repulsive-only potential, i.e. a very short cutoff, you have very few neighbors per atom and thus the time spent in the computing of Pair has to be rather short. this will artificially increase the relative cost of everything else, including the neighbor list processing and communication.
  • you have a load balancing problem due to a significant change to the box dimensions after the read_data command. the distribution of MPI ranks to subdomains is set when creating the (initial) box and thus based on wrong assumptions. you can compensate for that by using the processors command. due to the shape of the system, you should be using: either processors 1 1 * or processors 2 2 * with the latter only being available when the number of MPI ranks is divisible by 4.

Is this a good strategy: I eliminate the change_box command, apply the change directly in the data file, and then read the datafile.

According to the documentation, I need to sue the processors command before read_data or the combination of read_data and change_box.

  • your neighbor list parameters are OK but can be slightly optimized to neigh_modify delay 4 every 2 check yes
  • you may want to try avoiding to use fix recenter

I remove this fix and handle it in post-processing.

  • when using hyper-threading there is a significant slowdown

As far as I know, the Graham cluster does not use hyper-threading.

Thanks for all your support.
Amir

Dear Axel,

FYI, I have made some experiments with your input deck on a local large shared memory server (to minimize the impact of the communication hardware).
here are some observations:

  • since you use a repulsive-only potential, i.e. a very short cutoff, you have very few neighbors per atom and thus the time spent in the computing of Pair has to be rather short. this will artificially increase the relative cost of everything else, including the neighbor list processing and communication.
  • you have a load balancing problem due to a significant change to the box dimensions after the read_data command. the distribution of MPI ranks to subdomains is set when creating the (initial) box and thus based on wrong assumptions. you can compensate for that by using the processors command. due to the shape of the system, you should be using: either processors 1 1 * or processors 2 2 * with the latter only being available when the number of MPI ranks is divisible by 4.

Is this a good strategy: I eliminate the change_box command, apply the change directly in the data file, and then read the datafile.

According to the documentation, I need to sue the processors command before read_data or the combination of read_data and change_box.

it won’t make a difference. given your system geometry you want to use “processors 1 1 *” or “processors 1 2 *” or “processors 2 2 *” anyways to avoid a load imbalance in x and y direction. the last is probably the fastest. but that can be easily found out empirically.

axel.

Dear Axel,

To test these ideas and find the best combination, do I need to run a whole simulation with 5.1*10^7 timesteps (equilibration + sampling) and measure static and dynamic properties of my interest? Or is it enough to run short simulations of 10^4 or 10^5 in equilibration mode and check the log file?

Thanks for your help,
Amir

short runs should do.
since you have random initial positions, you should do a short equilibration, perhaps 100 steps, as you would initially need more frequent neighbor list updates when particles are moving faster.
running for 1000 steps with production settings should be enough to find the optimal setting

if the timing is not reliable or the differences too small for easy determination of which setting is faster, you can increase those by one order of magnitude. i don’t think any longer runs will provide better information.

axel.

Dear Axel,

I run a few tests with different numbers of cpus as well as different settings of processors and neigh_modify commands in Lammps.

Please see the attached figure, lammps script and data file, and. It seems for system with ~ 80000 WCA particles and a short chain in cylindrical confinement, 8 cpus with the following command is the best choice:

neigh_modify delay 2every 2 check on page 100000 one 2000
processors 1 1 *

However the overall decrease in the combined cpu time of communications and building of neighbor lists is less than 10%. To test, each system is run for 10000 timesteps.

Thanks for support.
Amir

N80epsilon5.0r3.5lz29.25sig0.2nc78624ens1.data (6.59 KB)

input.lmp (4.16 KB)

neighbor.pdf (15.5 KB)

I don’t think there is much more that you can do. As already explained the situation is due to the choice of model for non-bonded interactions which requires very few neighbors. The only additional tweak that I can think of would be to try increasing the neighbor list skin distance. I suggest a range between 0.2 and 0.5 to find the optimal value. Given the small number of neighbors, you should be seeing a small speedup before the O(n**3) scaling of the neighbor list growth starts to dominate the savings from fewer neighbor list updates.

Axel.

I should perhaps add that the performance of the neighborlist build crucially depends not only on the CPU speed, but also on the memory bandwidth. You are running on a hardware where this is - by design - more limited than on machines with newer/faster hardware and fewer CPU cores per node.

Dear Axel

I did some test runs with rskin=0.2,0.3,0.4,0.5, every=0,1,2,4 and delay=0,1,2,4,10,20, on 2-, 4- , and 8-core machines with 100000 atoms and 100000 timesteps. The variance in the sum of neigh_total and comm_total is

  1. 4.23% (less than 4.77% for neigh and less than 0.21% for comm) on 2-core machine.
    The sum_pct = 64.50%(0.30%) (mean(sem))
    The neigh_pct = 62.90%(0.32%) (mean(sem))
    The com_pct = 1.61%(0.07%) (mean(sem))
  2. 4.64% (less than 5.22% for neigh and less than 0.09% for comm) on 4-core machine.
    The sum_pct = 66.52%(0.31%) (mean(sem))
    The neigh_pct = 63.94%(0.33%) (mean(sem))
    The com_pct = 2.58%(0.04 %) (mean(sem))
  3. 7.49% (less than 12.85% for neigh and less than 2.82% for comm) on 8-core machine.
    The sum_pct = 71.63%(0.39%) (mean(sem))
    The neigh_pct = 60.96%(0.52%) (mean(sem))
    The com_pct = 10.67%(0.24%) (mean(sem))

The relative error in the temperate, pressure and total energy of the system is less than or equal to 0.1%.

The set of every, delay, and rskin values maximizing timesteps/s are totally different for different number of cores.

With all this explanation, my question is this: If I use different combinations of the neighbor, neigh_modify, and processors command for different number of cores, does the physics of problem change?

For example, I use rskin=0.5, delay=10, every=1 when I use 2-core machine for atoms<5000 while I use rskin=0.3, delay=20, every=4 when I use 4-core for 5000<=atoms<30000.

If I assume that the time of simulation is linearly change with the number of atoms at a fixed number of cores, is it reasonable to use this question to estimate the time needed for running a similar system with a different number of atoms over a longer time?

ttotal = (natoms/test_natoms)*(nsteps/test_nsteps)*test_ttotal # seconds

ttotal_hr = ttotal/3600 # hours

Please see the attached txt file for a summary of the summary of performance and the MPI take timing breakdown form all my test runs.

Thanks,
Amir

neighbor_testing.txt (48.1 KB)

Dear Axel

I did some test runs with rskin=0.2,0.3,0.4,0.5, every=0,1,2,4 and delay=0,1,2,4,10,20, on 2-, 4- , and 8-core machines with 100000 atoms and 100000 timesteps. The variance in the sum of neigh_total and comm_total is

  1. 4.23% (less than 4.77% for neigh and less than 0.21% for comm) on 2-core machine.
    The sum_pct = 64.50%(0.30%) (mean(sem))
    The neigh_pct = 62.90%(0.32%) (mean(sem))
    The com_pct = 1.61%(0.07%) (mean(sem))
  2. 4.64% (less than 5.22% for neigh and less than 0.09% for comm) on 4-core machine.
    The sum_pct = 66.52%(0.31%) (mean(sem))
    The neigh_pct = 63.94%(0.33%) (mean(sem))
    The com_pct = 2.58%(0.04 %) (mean(sem))
  3. 7.49% (less than 12.85% for neigh and less than 2.82% for comm) on 8-core machine.
    The sum_pct = 71.63%(0.39%) (mean(sem))
    The neigh_pct = 60.96%(0.52%) (mean(sem))
    The com_pct = 10.67%(0.24%) (mean(sem))

The relative error in the temperate, pressure and total energy of the system is less than or equal to 0.1%.

The set of every, delay, and rskin values maximizing timesteps/s are totally different for different number of cores.

With all this explanation, my question is this: If I use different combinations of the neighbor, neigh_modify, and processors command for different number of cores, does the physics of problem change?

not for as long you don’t get any “dangerous builds” in the neighbor list statistics. once those show up, there is a finite chance that you are missing to compute pairwise interactions because the neighbor list has not (yet) been updated as needed.

the trajectories for the different settings will exponentially diverge since MD is a chaotic system but the statistically converged averages must be the same, IFF the simulation is ergodic and sufficiently sampling the representative sections of phase space. but ergodicity is a completely different topic and relates to the necessary duration of simulations in general and the construction of models and the likelihood of “rare events”.

For example, I use rskin=0.5, delay=10, every=1 when I use 2-core machine for atoms<5000 while I use rskin=0.3, delay=20, every=4 when I use 4-core for 5000<=atoms<30000.

If I assume that the time of simulation is linearly change with the number of atoms at a fixed number of cores, is it reasonable to use this question to estimate the time needed for running a similar system with a different number of atoms over a longer time?

this, again, is a different question from what we discussed before. strong scaling and weak scaling of specific systems and the software has its peculiar behavior and many factors influence it: number of cores in use, memory bandwidth, sizes of CPU caches, scaling behavior of the features in the MD code in use, communication performance etc.

axel.

amir,

please also let me remind you that you are starting to vastly overengineer this whole thing.
there is obviously not too much improvements to be made in total and you are now running the risk of spending more time on finding the optimal setting than the amount of time you’re going to save by it. please recall, that it often takes rather little effort to go 80-90% of the way and then a lot more to go the rest.
i would just choose a reasonable setting that is giving decent performance for most configurations and then be done with it.

axel.