How to reduce the kspace timings

Dear Lammps user.

I am using Lammps (Stable version (3 Mar 2020)) to model the evaporation of n-alkanes nano-droplet (OPLS-AA force field, 55200 atoms in total in liquid phase, placed in the center of the box with edge length of 50 nm) in nitrogen (95500 atom in total in gas phase, zero charge and fix shake). The issue of my simulation is that the time of PPPM kspace is too large (82.17%) as shown in the below log file, although I know that it is normal to take more time to calculate long range Coulomb force. I have tried some of the methods according to the previous mailing lists to reduce the kspace time. For example, I tried to adjust the mesh size and order by using kspace_modify order 2/4/6/7. I compared different order for 5000 steps, the time of kspace still keep around 85%. I also tried fix tune/kspace 100, but it stopped with the error: Fatal error in MPI_Sendrecv: Message truncated, error stack. Message from rank 257 and tag 0 truncated; 2560 bytes received but buffer size is 4.

I really appreciate it, if anyone can give me some suggestions on how to reduce the time of kspace and accelerate the simulation of this kind of system.

Best regards,

Alishanda

The parameters of my input scripts:

pair_style hybrid lj/cut/coul/long 12.0 12.0

pair_modify mix geometric tail yes

kspace_style pppm 1.0e-5

minimize 1.0e-4 1.0e-6 100 1000

reset_timestep 0

fix SHAKE all shake 0.0001 20 0 b 19 20 21……

velocity all create 360 12345

neighbor 2.0 bin

neigh_modify delay 0 every 1 check yes

timestep 2.0

fix 1 all nvt temp 360 360 200

……

The log file:

Loop time of 63457.5 on 480 procs for 1000000 steps with 104430 atoms

Performance: 2.723 ns/day, 8.814 hours/ns, 15.759 timesteps/s

99.7% CPU use with 480 MPI tasks x 1 OpenMP threads

MPI task timing breakdown:

Section | min time | avg time | max time |%varavg| %total

you have two problems with how you run your system:

  1. you use too many MPI ranks. the 3d-FFTs required for PPPM cannot scale to an infinite number of MPI ranks. please note that a 3d-FFT is essentially 3 groups of 1d FFTs each across a “line” in the “brick” of data. So you can parallelize them only in 2d (as “pencils”) and you have to do a transpose of the entire data brick in between and before assemble the grid data from the domain decomposed atom data and then distribute the potential data to the subdomains after the forward and backward FFTs. Because of this process the parallel efficiency is dropping since there is less data to work on per processor the more MPI ranks are used and at the same time the communication overhead increases. so at some point that overhead will become dominant. it is faster to run with fewer MPI ranks. there are two options to speed up calculations without increasing the number of MPI ranks used for the 3d-FFTs: you can use a combination of OpenMP and MPI to parallelize and you can use the verlet/split run style and do a multi-partition run where the PPPM calculation is run on a separate partition which can have fewer MPI ranks (the other partition must be an integer multiple of the PPPM processes, i.e. you can have a 1:1, 1:2, 1:3 etc split).

  2. your performance summary also shows a massive load imbalance. your time spent in Pair ranges from 24.562s to 38292s so some MPI ranks are quite idle. this can be addressed with the balance command, since from your description it sounds like your particle distribution is not homogeneous, but LAMMPS will set up the domain decomposition it uses for parallelization under the assumption that it is.

Dear Lammps user.

I am using Lammps (Stable version (3 Mar 2020)) to model the evaporation of n-alkanes nano-droplet (OPLS-AA force field, 55200 atoms in total in liquid phase, placed in the center of the box with edge length of 50 nm) in nitrogen (95500 atom in total in gas phase, zero charge and fix shake). The issue of my simulation is that the time of PPPM kspace is too large (82.17%) as shown in the below log file, although I know that it is normal to take more time to calculate long range Coulomb force. I have tried some of the methods according to the previous mailing lists to reduce the kspace time. For example, I tried to adjust the mesh size and order by using kspace_modify order 2/4/6/7. I compared different order for 5000 steps, the time of kspace still keep around 85%. I also tried fix tune/kspace 100, but it stopped with the error: Fatal error in MPI_Sendrecv: Message truncated, error stack. Message from rank 257 and tag 0 truncated; 2560 bytes received but buffer size is 4.

fix tune/kspace won’t really help in your case since the problem lies elsewhere. the version you have in your LAMMPS executables is broken. it got fixed rather recently.

what it does is to adjust the realspace coulomb cutoff, as that affects how much of the computation is spent in real space and how much in kspace. when not accounting for MPI overhead, the real space cutoff is most of the time already optimal, as its minimum value is determined by the cutoff required for the Lennard-Jones part (for Coulomb-only it could be shorter since the FFTs scale O(N*log(N)) and real space O(N**2), but since you need to build and walk the neighbor list based on the larger of the Lennard-Jones and Coulomb cutoff, you can just stick with the LJ one for Coulomb. with very many MPI ranks however the FFTs scale with a high power of N due to the reasons listed under point 1) above.

in summary. 1) run with fewer MPI ranks (should be faster), 2) then address the load balancing issue, 3) then see if you can improve overall performance by making the coulomb cutoff moderately larger (up to 15-20 \AA). 4) see if you can improve performance further by using OpenMP on top of MPI or verlet/split. OpenMP for pair styles is quite efficient for moderate numbers of OpenMP threads (2-8).

at this point, there are likely other issues with your input that may interfere with optimal performance.

axel.