[lammps-users] The efficiency of PPPM

Dear LAMMPS users,

Now I am using PPPM to calculate long-range Coulomb force in my system. And all the other short-range interaction is LJ (12-6) with cutoff of 12A. I set

kspace_style pppm 1.0e-4
kspace_modify order 4

The total number of atoms in my system is about 63,000 and the simulation box is about 31520055 A^3. Through procs testing, the computing time almost converges to minimum with 72 procs or so on my server. I found the Kspace time was around 80% of the total computing time and 10,000 time steps took about half hours.

I feel this is still very slow based on the domain decomposition method in parallel coding. Is there any way to speed up? How can I know what’s the most efficient setting for the PPPM calculation in LAMMPS?

Your kind help would be most appreciated.

Best
Yajie

Dear LAMMPS users,

Now I am using PPPM to calculate long-range Coulomb force in my system. And
all the other short-range interaction is LJ (12-6) with cutoff of 12A. I set

kspace_style pppm 1.0e-4
kspace_modify order 4

The total number of atoms in my system is about 63,000 and the simulation
box is about 315*200*55 A^3. Through procs testing, the computing time
almost converges to minimum with 72 procs or so on my server. I found the
Kspace time was around 80% of the total computing time and 10,000 time steps
took about half hours.

I feel this is still very slow based on the domain decomposition method in
parallel coding. Is there any way to speed up? How can I know what's the
most efficient setting for the PPPM calculation in LAMMPS?

please note, that while domain decomposition has in principle linear
scaling, PPPM does not(!!). since you have to do 3d FFTs it can at
best scale as O(N*log(N)), and particularly the very high demand for
communication during the transposes required for the 3d FFT is limiting
the scaling.

there is no easy way out of this. you already use a rather low convergence
for the kspace, so reducing the grid further would lead to instabilities.

if you only use LJ-12-6 and coulomb, you could try running the same system
with a code like NAMD that can hide the communication latencies due to
using a different "middle ware" called charm++.

depending on your hardware, you may be better off not to use all processor
cores. intel core2 architecture machine are often very limited in memory
bandwidth and on our harpertown (dual Xeon E5430) machine, using half
the cores is the fastest way to run LAMMPS at the scaling limit. for bio
systems, we get up to 2x the throughput with NAMD.

the final option is to do serious programming and program a faster, and
particularly better scaling k-space solver. that is a non-trivial task.

another idea that has been bounced around is to rewrite LAMMPS to support
hybrid parallelism, either GPU + MPI (which is underway) or Threading+MPI
(which some people have tried and are still trying) that would limit the demand
for communication in the kspace part. you should watch the _absolute_ time
spent in kspace to get an idea whether that could help.

FWIW: at less than 1000 atoms / cpu core you are actually doing quite ok.

HTH,
   axel.

In addition to Alex's comments, you can play with params of
the kspace_modify command and your short-range cutoff to shrink
the amount of work done in the long-range solve.

Steve

Yajie,

You might consider using a longer cutoff, which would put more of the coulombic calculation in real space, shrink the number of grid points, and alleviate the kspace cost.

Paul