Kspace calculations taking +99% of task time


I am trying to run MD of a single atomistic polymer chain. The simulation is exceedingly slow, apparently due to long-range charge interaction calculations (kspace). I have read about the following techniques for speeding up the simulation:

  • Using intel acceleration suffix (suffix intel)
  • Running on GPUs and using associated acceleration packages
  • Partitioning processors (partition command with run_style verlet/split)
  • Changing kspace order (kspace_modify)

I have run a simulation using all of these techniques except the GPUs, which I am still working on. The attachments are the lammps input and log files for this simulation running 9 MPI tasks on two partitions (8x1). While CPU efficiency was >90% for both partitions, kspace accounted for >99% of the MPI task time in both cases.

After 7 days, only ~35,000 time steps had elapsed.

I have tried running a similar simulation with more processors, which was “faster” but efficiency suffered. I realize that this is in part due to the density gradient through the simulation box (a single chain stretched diagonally through a cube) which could be helped in part with the balance command (which I have not tried yet).

Any pointers would be appreciated.

Kind regards

log.lammps.0 (322.5 KB)
log.lammps.1 (322.5 KB)
polysystemNEW20.in (1.2 KB)

You have only 2307 atoms in a box of 1000x1000x1000 Angstrom^3.
Why do you use kspace and the corresponding long-range Coulomb pair styles at all?
With kspace styles, especially pppm the computational effort scales with the volume.

Please also note that you have a serious load imbalance of the real-space domain decomposition:

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
Pair    | 0.0022991  | 1.8345     | 10.013     | 239.6 |  0.05
Bond    | 0.0064683  | 1.4092     | 7.6948     | 210.0 |  0.04

Compare “avg time” with “max time”. Those should be similar.

I would suspect that using a long cutoff instead of kspace and OPENMP parallelization instead of MPI may result in much better performance without much loss of accuracy.

I see, thanks! I’m just getting started with MD. I’ll try it with a cut pair style instead of a long.