Accelerate the simulation-kspace as the bottleneck

Hi,

Here is my input:

units metal
atom_style full

boundary p p f # slab in z direction

bond_style hybrid morse harmonic
angle_style harmonic

read_data final-nvt.data

pair_style hybrid buck/coul/long 12.66 lj/cut/coul/long 9.8 12.66

pair_coeff 1 * buck/coul/long 0.0 1.0 0.0 # ti
pair_coeff 2 * buck/coul/long 0.0 1.0 0.0 # ts
pair_coeff 3 * buck/coul/long 0.0 1.0 0.0 # o
pair_coeff 4 * buck/coul/long 0.0 1.0 0.0 # ot
pair_coeff 5 * buck/coul/long 0.0 1.0 0.0 # ht
pair_coeff 6 * buck/coul/long 0.0 1.0 0.0 # ob
pair_coeff 7 * buck/coul/long 0.0 1.0 0.0 # hb
pair_coeff 8 * buck/coul/long 0.0 1.0 0.0 # os

pair_coeff 9 * buck/coul/long 0.0 1.0 0.0 # ow
pair_coeff 10 * buck/coul/long 0.0 1.0 0.0 # hw

pair_coeff 1 1 buck/coul/long 31119.34421 0.154 5.249854339 # ti-ti
pair_coeff 1 2 buck/coul/long 31119.34421 0.154 5.249854339 # ti-ts
pair_coeff 2 2 buck/coul/long 31119.34421 0.154 5.249854339 # ts-ts

pair_coeff 1 3 buck/coul/long 16957.06212 0.194 12.58965351 # ti-o
pair_coeff 1 6 buck/coul/long 16957.06212 0.194 12.58965351 # ti-ob
pair_coeff 1 8 buck/coul/long 16957.06212 0.194 12.58965351 # ti-os

pair_coeff 2 3 buck/coul/long 16957.06212 0.194 12.58965351 # ts-o
pair_coeff 2 6 buck/coul/long 16957.06212 0.194 12.58965351 # ts-ob
pair_coeff 2 8 buck/coul/long 16957.06212 0.194 12.58965351 # ts-os

pair_coeff 1 4 buck/coul/long 13680.19393 0.194 12.58965351 # ti-ot
pair_coeff 2 4 buck/coul/long 13680.19393 0.194 12.58965351 # ts-ot

pair_coeff 3 3 buck/coul/long 11782.43392 0.234 30.21916735 # o-o
pair_coeff 3 4 buck/coul/long 11782.43392 0.234 30.21916735 # o-ot
pair_coeff 3 6 buck/coul/long 11782.43392 0.234 30.21916735 # o-ob
pair_coeff 3 8 buck/coul/long 11782.43392 0.234 30.21916735 # o-os

pair_coeff 4 4 buck/coul/long 11782.43392 0.234 30.21916735 # ot-ot
pair_coeff 4 6 buck/coul/long 11782.43392 0.234 30.21916735 # ot-ob
pair_coeff 4 8 buck/coul/long 11782.43392 0.234 30.21916735 # ot-os

pair_coeff 6 6 buck/coul/long 11782.43392 0.234 30.21916735 # ob-ob
pair_coeff 6 8 buck/coul/long 11782.43392 0.234 30.21916735 # ob-os

pair_coeff 8 8 buck/coul/long 11782.43392 0.234 30.21916735 # os-os

pair_coeff 1 9 buck/coul/long 1239.879126 0.265 6.417724 # ti-ow
pair_coeff 2 9 buck/coul/long 1239.879126 0.265 6.417724 # ts-ow

pair_coeff 9 9 lj/cut/coul/long 0.00673835 3.166 # ow-ow
pair_coeff 2 9 lj/cut/coul/long 0.00673835 3.166 # o-ow
pair_coeff 4 9 lj/cut/coul/long 0.00673835 3.166 # ot-ow
pair_coeff 6 9 lj/cut/coul/long 0.00673835 3.166 # ob-ow

kspace_style pppm 1e-6
kspace_modify slab 3.0

special_bonds coul 0.0 0.0 1.0

group fixed type 1 3
group free subtract all fixed

neighbor 2.0 bin
neigh_modify delay 0 every 1 check yes exclude group fixed fixed

timestep 0.0007

thermo_style custom step time temp press pe ke etotal enthalpy evdwl ecoul lx ly lz
thermo 50
dump coordinates all xyz 50 traj.xyz

comm_modify cutoff 20.0

fix 1_shake free shake 0.0001 20 0 b 1 2 a 1 2
fix wall all wall/reflect zhi EDGE

#velocity all create 50.0 1281937

fix fix_nvt free nvt temp 50.0 100.0 100.0
run 5000

Please find the data file attached. Thanks for your help.

Sincerely,
Azade

final-nvt.data (1.23 MB)

Hi,

Here is my input:

​the next time, please compress your attached text files with gzip, so they
are much smaller and thus less of a bother.​

onward to the problem at hand. first off, there is a small issue with your
data file, you have atoms outside the box. your input only works by
accident.​
​also, your simulation box has too much vacuum in z-direction, this is not
needed on input, but will be added automatically with kspace_modify slab
3.0.
because of that, you have *much* more vacuum in your box, and due to using
a kspace solver, you still need to do work in that vacuum. luckily, LAMMPS
can update this for you with a few small script commands:

​units metal
atom_style full
boundary p p p

bond_style hybrid morse harmonic
angle_style harmonic

read_data final-nvt.data

# reset image flags in z-direction and shrink box
set group all image NULL NULL 0
change_box all z final 10 80

write_data updated.data

​but this is only the start. please find below added/modified commands and
my comments.​

units metal

atom_style full

​processors * * 1

this changes how LAMMPS subdivides the system. it usually does this by
volume, but in z-direction, that is a bad idea (due to the vacuum for the
slab), so we enforce that there is only a domain decomposition in x and y.
this helps for running in parallel.​

boundary p p f # slab in z direction

bond_style hybrid morse harmonic
angle_style harmonic

read_data final-nvt.data

​read_data updated.data

this reads in the updated box with the corrected image flags and the
reduced dimensions in z direction.

pair_style hybrid buck/coul/long 12.66 lj/cut/coul/long 9.8 12.66

pair_coeff 1 * buck/coul/long 0.0 1.0 0.0 # ti
pair_coeff 2 * buck/coul/long 0.0 1.0 0.0 # ts
pair_coeff 3 * buck/coul/long 0.0 1.0 0.0 # o
pair_coeff 4 * buck/coul/long 0.0 1.0 0.0 # ot
pair_coeff 5 * buck/coul/long 0.0 1.0 0.0 # ht
pair_coeff 6 * buck/coul/long 0.0 1.0 0.0 # ob
pair_coeff 7 * buck/coul/long 0.0 1.0 0.0 # hb
pair_coeff 8 * buck/coul/long 0.0 1.0 0.0 # os

pair_coeff 9 * buck/coul/long 0.0 1.0 0.0 # ow
pair_coeff 10 * buck/coul/long 0.0 1.0 0.0 # hw

pair_coeff 1 1 buck/coul/long 31119.34421 0.154 5.249854339 # ti-ti
pair_coeff 1 2 buck/coul/long 31119.34421 0.154 5.249854339 # ti-ts
pair_coeff 2 2 buck/coul/long 31119.34421 0.154 5.249854339 # ts-ts

pair_coeff 1 3 buck/coul/long 16957.06212 0.194 12.58965351 # ti-o
pair_coeff 1 6 buck/coul/long 16957.06212 0.194 12.58965351 # ti-ob
pair_coeff 1 8 buck/coul/long 16957.06212 0.194 12.58965351 # ti-os

pair_coeff 2 3 buck/coul/long 16957.06212 0.194 12.58965351 # ts-o
pair_coeff 2 6 buck/coul/long 16957.06212 0.194 12.58965351 # ts-ob
pair_coeff 2 8 buck/coul/long 16957.06212 0.194 12.58965351 # ts-os

pair_coeff 1 4 buck/coul/long 13680.19393 0.194 12.58965351 # ti-ot
pair_coeff 2 4 buck/coul/long 13680.19393 0.194 12.58965351 # ts-ot

pair_coeff 3 3 buck/coul/long 11782.43392 0.234 30.21916735 # o-o
pair_coeff 3 4 buck/coul/long 11782.43392 0.234 30.21916735 # o-ot
pair_coeff 3 6 buck/coul/long 11782.43392 0.234 30.21916735 # o-ob
pair_coeff 3 8 buck/coul/long 11782.43392 0.234 30.21916735 # o-os

pair_coeff 4 4 buck/coul/long 11782.43392 0.234 30.21916735 # ot-ot
pair_coeff 4 6 buck/coul/long 11782.43392 0.234 30.21916735 # ot-ob
pair_coeff 4 8 buck/coul/long 11782.43392 0.234 30.21916735 # ot-os

pair_coeff 6 6 buck/coul/long 11782.43392 0.234 30.21916735 # ob-ob
pair_coeff 6 8 buck/coul/long 11782.43392 0.234 30.21916735 # ob-os

pair_coeff 8 8 buck/coul/long 11782.43392 0.234 30.21916735 # os-os

pair_coeff 1 9 buck/coul/long 1239.879126 0.265 6.417724 # ti-ow
pair_coeff 2 9 buck/coul/long 1239.879126 0.265 6.417724 # ts-ow

pair_coeff 9 9 lj/cut/coul/long 0.00673835 3.166 # ow-ow
pair_coeff 2 9 lj/cut/coul/long 0.00673835 3.166 # o-ow
pair_coeff 4 9 lj/cut/coul/long 0.00673835 3.166 # ot-ow
pair_coeff 6 9 lj/cut/coul/long 0.00673835 3.166 # ob-ow

kspace_style pppm 1e-6
kspace_modify slab 3.0

​kspace_modify order 7

the kspace order parameter tweaks the PPPM algorithm. a higher order means
more computation per particle but a coarser grid with less gridpoints, and
smaller order means less work per atom, but a larger grid. allowed values
are between 2 and 7. for a small system with lots of vacuum running on few
processors are higher order is faster. for a large, dense system​ running
on many processors a smaller order is faster. so we go with 7 here. you can
experiment on your machine and find empirically which is faster for you.

special_bonds coul 0.0 0.0 1.0

group fixed type 1 3
group free subtract all fixed

neighbor 2.0 bin
neigh_modify delay 0 every 1 check yes exclude group fixed fixed

timestep 0.0007

thermo_style custom step time temp press pe ke etotal enthalpy evdwl
ecoul lx ly lz
thermo 50
dump coordinates all xyz 50 traj.xyz

comm_modify cutoff 20.0

fix 1_shake free shake 0.0001 20 0 b 1 2 a 1 2
​​
fix wall all wall/reflect zhi EDGE



fix wall all wall/reflect zhi 80.0

this has nothing to do with performance, but how the slab configuration
works in LAMMPS. as it changes your box boundaries internally, you must not
use the internal value, but the original limit, i.e. 80.0 (after the box
shrink).

#velocity all create 50.0 1281937

fix fix_nvt free nvt temp 50.0 100.0 100.0
run 5000

Please find the data file attached. Thanks for your help.

​for faster testing, i've reduce the number of steps to 500​ and with those
i have on my test machine with the original input:

Loop time of 166.931 on 4 procs for 500 steps with 6575 atoms

Performance: 0.181 ns/day, 132.485 hours/ns, 2.995 timesteps/s
71.3% CPU use with 4 MPI tasks x 1 OpenMP threads

MPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %total

Hi,

Thanks a lot Axel for all your comments and modifications. To sum up, changing ewald to pppm and implementing your changes accelerated the simulation by 12x times which is very good.
However, I am facing the ‘Out of range atoms - cannot compute PPPM’ error. I will keep reading about it and will start another thread for it.

Sincerely,
Azade

A five-star answer …

nice job Axel,
Steve

You can also use “kspace_modify fftbench yes” to get more kspace timings for pppm, see http://lammps.sandia.gov/doc/kspace_modify.html. If the FFTs are taking a significant portion of the kspace time, then “run_style verlet/split” can help too, see http://lammps.sandia.gov/doc/run_style.html​.