Acceleration for ELECTRODE package

Hi there,

I am using the electrode package with “algo cg” (system restriction from using the matrix option algo) to assign charges to two electrodes. Compared to the uncharged system, the ‘Modify Section’ time in LAMMPS log file is increasing significantly. I wonder whether I could set the update frequency of computing charges, e.g. not per time step? And would there be any Accerlerator package to improve this performance, currentlt I am thinking INTEL Package and OPT Package.

Many thanks, Catherine

Hi Catherine,

What magnitude of extra time are you seeing?

In general, per timestep, the conjugate gradient equilibration incurs an extra cost of N CG steps * 1 electrostatic evaluation per CG step. For a typical system this would mean 2-10x more CPU time per timestep. Sadly this is the “cost of doing business”, but it should not increase the simulation costs by more than an order of magnitude (in which case you might as well use DFT, ML MD, …)

It is important to make sure that your system is well-behaved and that the load is divided evenly between processors – each processor should “own” similarly sized chunks of electrode. It is often useful to set

processors * * 2

at the start of the simulation – in most settings where there are two electrodes, one at each z-end, this will help ensure half the processors work on the positive z electrode and half work on the negative z electrode.

Hi Shern, thanks for the info. Yeah that makes sense. The charged case is about 5x time as the uncharged case. And I am using 2 by 5 by 10 MPI processor grid.

For the uncharged case:

Loop time of 19916.8 on 100 procs for 2000000 steps with 50166 atoms

Performance: 8.676 ns/day, 2.766 hours/ns, 100.418 timesteps/s, 5.038 Matom-step/s
96.3% CPU use with 100 MPI tasks x 1 OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 11.866     | 9095.9     | 11760      |3495.1 | 45.67
Bond    | 1.3997     | 567.69     | 732.62     | 903.9 |  2.85
Kspace  | 3820       | 6723.9     | 16618      |4396.8 | 33.76
Neigh   | 867.62     | 879.74     | 900.75     |  23.6 |  4.42
Comm    | 672.09     | 1282.4     | 1626.8     | 563.2 |  6.44
Output  | 0.90518    | 1.4173     | 1.9936     |  27.5 |  0.01
Modify  | 1072.4     | 1223.6     | 1655.4     | 442.6 |  6.14
Other   |            | 142.1      |            |       |  0.71

and for the charged case:

Loop time of 83859.3 on 100 procs for 2000000 steps with 50166 atoms

Performance: 2.061 ns/day, 11.647 hours/ns, 23.849 timesteps/s, 1.196 Matom-step/s
99.3% CPU use with 100 MPI tasks x 1 OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 1.8363     | 17775      | 27132      |7534.6 | 21.20
Bond    | 2.7797     | 858.05     | 1275.5     |1601.3 |  1.02
Kspace  | 5502.8     | 15267      | 34126      |8554.0 | 18.21
Neigh   | 1932.4     | 1948.7     | 2005.6     |  47.1 |  2.32
Comm    | 52.061     | 1289.1     | 1703.7     |1353.9 |  1.54
Output  | 1.8335     | 2.9432     | 4.1194     |  39.2 |  0.00
Modify  | 46170      | 46571      | 47591      | 200.3 | 55.53
Other   |            | 148.5      |            |       |  0.18

And was wondering the large increment in ‘Modify Section’ time is reasonable, or there is further room to improve performance?

Many thanks, C

As @srtee hinted at, you could try to improve your load balance. You have quite a difference between the min, max and average time for Pair and Kspace, which consume together about 40% of your total time. Having this distributed more evenly across the processors should provide some speedup. The order of steps to improve this is to first use the processors command, then the balance command and for a “cheap” one time adjustment, and if that provides no relief to try fix balance (for dynamic adjustments if the geometry changes a lot) or comm_style tiled (for a different partitioning scheme). Both of the latter steps have additional overhead and thus have to be carefully tested. See the documentation of the individual commands for more info and also check out: 7. Accelerate performance — LAMMPS documentation

1 Like

Hi Catherine,

These are reasonable usage patterns. Notice how both pair and kspace also consume more time when using ELECTRODE to charge the the electrodes – this indicates some of the extra work is going just into calculating the Coulombic interactions for the now-charged electrode particles.

Other than @akohlmey ‘s suggestion to improve the balance, I would also recommend looking at using fewer MPI procs per run.

Right now you have 2ns/day running one system on 100 procs. Let’s say you could get 1ns/day running on 32 procs instead (seems feasible given the large imbalances). If you ran three separate runs each with their own 32 procs, you would get 3ns/day in total, a 1.5x increase in simulation speed. (You could run those 3 simulations at 3 different electrode voltages, or from 3 different initial coordinates.)

Also check that you are using computing resources effectively based on your cluster’s setup. Most clusters have CPUs packaged in chipsets of 8 cores or 12 cores, so it is highly likely that a 100-proc run may not be utilising full chipsets and leaving a bit of performance on the table. Furthermore, MPI communications on the same node are often faster than between different nodes, so if your system can at all fit onto a single node that will help reduce unnecessary communications latency.

1 Like

@akohlmey @srtee Thanks Alex and Shern for the useful suggestions. Good to know that my current setup is reasonable and will definitely try these approaches to optimize the performance. Many thanks, C