Strategies for improving performance with kspace pppm

Sean · December 17, 2022, 10:18pm

I have been wrestling with excessively slow computation time for my system. A few key pieces of information are below:

57675 atoms total
25 polymer chains (2307 atoms per chain)

Box dimensions:
0 50 xlo xhi
0 50 ylo yhi
-10 1200 zlo zhi

Init file:
units real
atom_style full
bond_style harmonic
angle_style harmonic
dihedral_style opls
improper_style harmonic
pair_style hybrid/overlay lj/smooth 8.25 11.25 coul/long 11.25
kspace_style pppm 1.0e10-3
kspace_modify gewald 0.1
special_bonds lj/coul 0.0 0.0 0.5
neigh_modify one 50000
neigh_modify page 500000

Pair_style cutoffs set based on 3*sigma for largest value of sigma in pair_coeff for used atom types.

The simulation box starts quite large due to the polymers initially being in the completely extended conformation, which, according to my current understanding, leads to processors being assigned to what will end up being empty space once the polymer chains collapse onto themselves and resulting load imbalance and poor performance.

To address this, I ran a brief npt using coul/cut instead of coul/long and while doing so implemented fix balance rcb with comm_style tiled to see if that would speed it up, which it did, a little. Once that simulation ended (150k time steps or so at dt = 1 fs), I took the data.restarter file from the end of that simulation (which was a smaller, more compact simulation box for the contracted “polymer clump”) and used it as the input for a second simulation, this time with kspace pppm and fix balance shift, comm_style brick. It does run more quickly than previous iterations (3.609 timesteps/s vs. 0.63 timesteps/s), but this is still quite slow for what is a relatively small system and the MPI breakdown shows inordinately large percent varavg values for most categories (see out.682639) which suggests load imbalance. CPU usage is decent (86%) for 32 cores and communication overhead is, at least, acceptable (5.8% of total MPI task). For both the faster simulation and the slower iteration, I set the pppm accuracy to 1.0e-3.

Aside from not using long-range electrostatics and relying on fix balance, what options are available to me for speeding this up? Attachments are for the faster, second simulation with kspace. (I should note that I removed neigh_modify one and page from the second, faster simulation, which should help.)

Not sure if this matters, but I do have to set kspace_modify gewald 0.1 to get the simulation to run. I am also using suffix intel.

Any help appreciated.

Kind regards,
Sean
out.682639 (3.0 MB)
polysystem45.in (1.1 KB)
polysystemNEW45_kspace.in.init (311 Bytes)

akohlmey · December 18, 2022, 1:53am

Since “fast” and “slow” are relative terms, it is difficult to start a discussion without having a reference for what should be an expected performance for the machine you are running on. To have a system that is somewhat comparable to what you are running, I suggest you take the “in.rhodo” and “data.rhodo” files from the “bench” folder of the LAMMPS source distribution and then make the following changes:

after read_data... add the line: replicate 2 1 1
change pair_style... into `pair_style lj/charmm/coul/long 9.25 11.25
change thermo 50 to thermo 500
change timestep 2.0 to timestep 1.0
change run... to run 1000
delete the fix shake line
in.rhodo.txt (547 Bytes)

When I run this on my desktop with a quad-core Intel i5-10210U CPU @ 1.60GHz (actual clock is 3.4GHz thanks for TurboBoost). I get the following timing summary:

Loop time of 154.739 on 4 procs for 1000 steps with 64000 atoms

Performance: 0.558 ns/day, 42.983 hours/ns, 6.462 timesteps/s, 413.599 katom-step/s
99.7% CPU use with 4 MPI tasks x 1 OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 117.99     | 119.25     | 120.59     |  10.9 | 77.06
Bond    | 5.1121     | 5.1871     | 5.2657     |   3.2 |  3.35
Kspace  | 13.524     | 14.94      | 16.263     |  32.8 |  9.65
Neigh   | 11.692     | 11.692     | 11.693     |   0.0 |  7.56
Comm    | 0.92444    | 0.92909    | 0.9329     |   0.4 |  0.60
Output  | 0.00019183 | 0.00020491 | 0.00024223 |   0.0 |  0.00
Modify  | 2.4963     | 2.5262     | 2.5394     |   1.1 |  1.63
Other   |            | 0.2199     |            |       |  0.14

I suggest you run it also with 4 processors and then also with 8, 16, and 32 processors.
That will give us a reference point to discuss what is fast/slow for this kind of system and then we can start looking into how much potential for improvement is in your simulation.

akohlmey · December 18, 2022, 3:47am

A few general observation from the bits of information that you provide. If you want proper help, you need to provide a complete input.

No you don’t. Your input file uses a value of “1.0e10-3” which will be interpreted as “1.0e10”.
However, that value will lead to a crash because it will cause a division by zero
It should be an error, but the current tests are not (yet) smart enough to catch this.
They will be after the next patch release.

Using such a large value is, of course, bogus.

That one is too small for the given cutoff. You want something like 0.25 or even 0.3.
You need to set this value because of the bogus input for the kspace style convergence.
If you correct that mistake, this will not be needed.

Why such a long range for smoothing the LJ potential? 1-2 \AA should do. At over 10 \AA cutoff there should be not much of a need to smooth the potential at all since has shrunk to less than a millionth at the cutoff. Using lj/cut/coul/long will be more efficient than using hybrid/overlay.

There is no indication in your output that the run was using that suffix and most likely your executable does not even include the INTEL package. If it had the package included, there would have been an error that you must not use an intel suffix style without the package intel command. The package command is inserted automatically with default values when you use the -suffix command line flag, but not with the suffix command in the input.

Since the lj/smooth pair style is not supported at all by the INTEL package, there is not a lot to gain from using the intel suffix in the first place. Since KSpace consumes about 20% of the total time, you would get 10% speedup if the benefit of using /intel would be a 2-fold speedup.

The timestep of 1fs could be too large (leading to bad energy conservation and occasional “lost bond atom” errors on longer runs) for a molecular system with hydrogen atoms. Similarly, the default neighbor list settings of every 1 delay 10 are not suitable for this timestep, either. This is confirmed by the report of dangerous builds.

Sean · December 18, 2022, 4:38am

That is all useful feedback, thank you.

First, for the hardware, I ran the benchmark as you specified. The results indicate that the node(s) that my jobs get submitted to are quite slow compared to your machine. (It is possible to request specific hardware, but this is not always possible. In that case, the job is dispatched to whatever hardware is available.)

All jobs were run with exclusive use of the node.

Loop time of 779.983 on 4 procs for 1000 steps with 64000 atoms

Performance: 0.111 ns/day, 216.662 hours/ns, 1.282 timesteps/s
99.4% CPU use with 4 MPI tasks x no OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 592.21     | 604.02     | 616.24     |  35.4 | 77.44
Bond    | 14.795     | 15.39      | 15.989     |  12.0 |  1.97
Kspace  | 73.959     | 86.778     | 99.187     |  97.3 | 11.13
Neigh   | 62.809     | 62.812     | 62.814     |   0.0 |  8.05
Comm    | 2.0233     | 2.1138     | 2.194      |   5.4 |  0.27
Output  | 0.00058985 | 0.00061095 | 0.000633   |   0.0 |  0.00
Modify  | 7.9431     | 8.2643     | 8.6194     |  10.4 |  1.06
Other   |            | 0.6015     |            |       |  0.08

Loop time of 413.848 on 8 procs for 1000 steps with 64000 atoms

Performance: 0.209 ns/day, 114.958 hours/ns, 2.416 timesteps/s
99.0% CPU use with 8 MPI tasks x no OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 309.25     | 316.03     | 328.98     |  38.0 | 76.36
Bond    | 7.7572     | 8.0098     | 8.4553     |   9.7 |  1.94
Kspace  | 37.822     | 50.801     | 57.147     |  94.4 | 12.28
Neigh   | 32.783     | 32.784     | 32.784     |   0.0 |  7.92
Comm    | 1.5667     | 1.6087     | 1.7378     |   5.3 |  0.39
Output  | 0.0003798  | 0.00039083 | 0.00041294 |   0.0 |  0.00
Modify  | 3.8128     | 4.3006     | 4.4635     |  12.5 |  1.04
Other   |            | 0.3167     |            |       |  0.08

Loop time of 210.754 on 16 procs for 1000 steps with 64000 atoms

Performance: 0.410 ns/day, 58.543 hours/ns, 4.745 timesteps/s
98.8% CPU use with 16 MPI tasks x no OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 153.96     | 159.06     | 166        |  32.9 | 75.47
Bond    | 3.8178     | 4.0161     | 4.2781     |   8.0 |  1.91
Kspace  | 20.685     | 27.652     | 32.924     |  79.1 | 13.12
Neigh   | 16.387     | 16.391     | 16.392     |   0.0 |  7.78
Comm    | 1.1936     | 1.226      | 1.3878     |   5.0 |  0.58
Output  | 0.00035    | 0.00037323 | 0.00047302 |   0.0 |  0.00
Modify  | 1.6154     | 2.1944     | 2.2948     |  13.8 |  1.04
Other   |            | 0.2132     |            |       |  0.10

Loop time of 90.1296 on 32 procs for 1000 steps with 64000 atoms

Performance: 0.959 ns/day, 25.036 hours/ns, 11.095 timesteps/s
99.8% CPU use with 32 MPI tasks x no OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 62.114     | 65.862     | 70.258     |  24.1 | 73.07
Bond    | 1.4317     | 1.6227     | 1.8117     |   8.8 |  1.80
Kspace  | 9.3894     | 13.844     | 17.389     |  52.8 | 15.36
Neigh   | 6.9299     | 6.934      | 6.9374     |   0.1 |  7.69
Comm    | 0.76314    | 0.79615    | 0.83843    |   2.4 |  0.88
Output  | 0.00030327 | 0.00032578 | 0.00035286 |   0.0 |  0.00
Modify  | 0.82642    | 0.94382    | 1.025      |   5.9 |  1.05
Other   |            | 0.1269     |            |       |  0.14

Quite right. Thank you for catching this. (shakes head)

Also quite right.

This is good to know. (A more senior member in my group suggested I use a smoothed potential.)

Sure enough. When including the flag at the command line, I receive an error that the INTEL package is not available. I’ll have to change that. (I’m using the distribution of lammps currently installed on the public high-performance computing drives… knowing what I know now, I may get my own.)

All of the inputs (with the revisions mentioned above) are attached.
polysystem45.in (1.1 KB)
polysystemNEW.in.charges (2.0 MB)
polysystemNEW45_kspace.in.init (277 Bytes)

The data and settings files are a bit large. Please see if this link allows you to download them from my protondrive:

akohlmey · December 18, 2022, 6:14am

Here are some more observations.

your system has 906 atom types, but most of the parameters are the same. Why? The simulation would be more efficient if you would only be using unique atom types.
your data file has all charges already set. Why set them again from the input?
your pair style uses a cutoff of 10 \AA but your pair_coeff settings overrides it with 15 \AA. These days a 12 \AA cutoff is accepted as a good compromise between accuracy and performance. Using a longer cutoff is just wasting time and is making the pair style slower.
your polymer geometry is anisotropic why use fix npt forcing isotropic box changes? with an anisotropic box change you should get rid of the vacuum areas causing load imbalance.

I made some quick checks on my desktop.

With the original input I get: 3.242 timesteps/s
Changing the LJ cutoff to 10.0 (like the coulomb cutoff) I get: 6.282 timesteps/s (almost double)
Switching the pair style from hybrid/overlay lj/smooth coul/long to lj/cut/coul/long I get: 9.267 timesteps/s
Using the optimized styles from the OPENMP package w/o OpenMP using -suffix omp gives: 10.898 timesteps/s
Turning on Hyper-threading and using 2 OpenMP threads with -suffix omp gives: 12.560 timesteps/s (won’t likely work on an HPC cluster).

So the problem is not really with pppm. And with some minimal changes there is almost a 4x speedup. However, if you want accurate results you need to increase the cutoff to 12 and also use a tighter pppm convergence (1.0e-5) and that will cost about half of the speedup. I am still getting7.769 timesteps/s with those settings. That is still about 2.5x faster than the original input and significantly more accurate.

With only 4 MPI processes, there is not much load-imbalance, either.

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 48.829     | 52.127     | 54.637     |  29.3 | 64.44
Bond    | 7.3815     | 7.4357     | 7.4964     |   1.9 |  9.19
Kspace  | 12.32      | 14.886     | 18.142     |  54.8 | 18.40
Neigh   | 4.2006     | 4.2031     | 4.2056     |   0.1 |  5.20
Comm    | 0.66197    | 0.70757    | 0.74146    |   3.4 |  0.87
Output  | 0.042411   | 0.047718   | 0.063424   |   4.2 |  0.06
Modify  | 1.3149     | 1.3411     | 1.3623     |   1.6 |  1.66
Other   |            | 0.1411     |            |       |  0.17

akohlmey · December 18, 2022, 6:29am

That is really slow, especially when considering that my executable is not fully optimized and my CPU is a CPU meant for laptops (it sits in an Intel NUC box). With a fully optimized executable, the calculation would probably run another 20-30% faster.

akohlmey · December 18, 2022, 6:41am

Here is another tweak to get some extra performace: my CPU has an internal Intel UHD Graphics GPU and with a sufficiently recent version of LAMMPS (as a developer, I use - of course - the most recent development code usually) and the required support libraries from Intel, it is possible to use that GPU with the GPU package via OpenCL. Using 4 OpenMP threads and 1 MPI process with -suffix hybrid gpu omp -pk gpu 0 pair/only yes I can more than double the CPU-only performance from 7.77 to 17.528 timestep/s. Not bad for a “wimpy” laptop CPU.

Sean · December 20, 2022, 12:21am

Thank you for all of these helpful tips! I’m working through them but wanted you to know I have seen them and appreciate your help.