[lammps-users] lammps performance

Hello,

I obtained the following timing data for a 1.2 million atoms simulation
on a Blue Gene P machine

# procs min
128 76.95
256 39.21
512 20.17
1024 10.35
2048 5.41
3072 3.73
4096 2.88
5120 2.37
6144 3.21
7168 2.54
8192 2.82

I am wondering whether this is what I should be getting and if not what
improvements can be made. Below follows additional info for two cases.
Thanks in advance for inputs.

valmor,

Hello,

I obtained the following timing data for a 1.2 million atoms simulation
on a Blue Gene P machine

lammps performance depends on much more than
just the number of particles. depending on the details
of the system you are running and the settings that you
are using, there are quite a few adjustable parameters
that might make a difference.

also, with many multi-core CPUs these days, you may
be better of by not using all processor cores for MPI
level parallelization and/or use multi-level parallelism,
e.g. in the form of OpenMP + MPI. i have no practical
information on how this works on BG/P, but you can
find some explanation and a poster demonstrating
the performance boost on different machines, including
a Cray XT5 on this page: http://goo.gl/4fcq

# procs min
128 76.95
256 39.21
512 20.17
1024 10.35
2048 5.41
3072 3.73
4096 2.88
5120 2.37
6144 3.21
7168 2.54
8192 2.82

I am wondering whether this is what I should be getting and if not what
improvements can be made. Below follows additional info for two cases.

without knowing the exact system, it is impossible to comment on this.
lammps doesn't do load balancing, and thus is dependent on having
a roughly uniform particle density across the whole simulation system.

also, you have to keep in mind, that with increasing system size, any
operation that requires a collective operation or all-to-all communication
will become increasingly expensive and thus limit scaling. finally, there
is an intrinsic degradation of parallel efficiency, due to serial overhead
depending on how much i/o and other non-parallel operations you have
in your input.

Thanks in advance for inputs.

--
Valmor

Loop time of 169.575 on 8192 procs for 1000 steps with 1216000 atoms

Pair time (\) = 56\.5647 \(33\.3568\) Bond time \() = 1.86976 (1.10262)
Kspce time (%) = 97.8776 (57.7195)

ouch!! as you can see, here the cost of
doing the 3d FFTs for PPPM are dominating.
at this point, you cannot expect much improvements
unless you are using multi-level parallelism.
what you could do, is to crank up the coulomb
cutoff in real space (only) and thus reduce the
work in k-space. but actually using 4096 processors
for MPI and then trying to tack on additional
parallelization through threading in the non-bonded
calculation, is the most promising approach.

the lammps-icms branch should allow you to do
just that. i've seen up to 4x speedup because of this.
also, using a single precision FFT can help, too.

cheers,
     axel.

sorry,
minor correction. the home page for my lammps branch
is at: http://goo.gl/oKYI

axel.

valmor,

Hello,

I obtained the following timing data for a 1.2 million atoms simulation
on a Blue Gene P machine

lammps performance depends on much more than
just the number of particles. depending on the details
of the system you are running and the settings that you
are using, there are quite a few adjustable parameters
that might make a difference.

Indeed; should have sent more info. This system is dodecane:

some relevant data:

pair_style lj/cut/coul/long 15.0 15.0
kspace_style pppm 1.0e-4
bond_style harmonic
angle_style harmonic
dihedral_style opls
improper_style none
special_bonds lj/coul 0.0 0.0 0.5

neighbor 2.0 bin
neigh_modify delay 5

NPT simulation

also, with many multi-core CPUs these days, you may
be better of by not using all processor cores for MPI
level parallelization and/or use multi-level parallelism,
e.g. in the form of OpenMP + MPI. i have no practical
information on how this works on BG/P, but you can
find some explanation and a poster demonstrating
the performance boost on different machines, including
a Cray XT5 on this page: http://goo.gl/4fcq

# procs min
128 76.95
256 39.21
512 20.17
1024 10.35
2048 5.41
3072 3.73
4096 2.88
5120 2.37
6144 3.21
7168 2.54
8192 2.82

I am wondering whether this is what I should be getting and if not what
improvements can be made. Below follows additional info for two cases.

without knowing the exact system, it is impossible to comment on this.
lammps doesn't do load balancing, and thus is dependent on having
a roughly uniform particle density across the whole simulation system.

also, you have to keep in mind, that with increasing system size, any
operation that requires a collective operation or all-to-all communication
will become increasingly expensive and thus limit scaling. finally, there
is an intrinsic degradation of parallel efficiency, due to serial overhead
depending on how much i/o and other non-parallel operations you have
in your input.

no IO for these test runs.

Thanks in advance for inputs.

--
Valmor

Loop time of 169.575 on 8192 procs for 1000 steps with 1216000 atoms

Pair time (\) = 56\.5647 \(33\.3568\) Bond time \() = 1.86976 (1.10262)
Kspce time (%) = 97.8776 (57.7195)

ouch!! as you can see, here the cost of
doing the 3d FFTs for PPPM are dominating.

Yes. In this system, when the average number of atoms per process drops
to less than 200, there is a big performance penalty. This coincides
with the point where the Kspce timing is greater than the Pair time.

at this point, you cannot expect much improvements
unless you are using multi-level parallelism.
what you could do, is to crank up the coulomb
cutoff in real space (only) and thus reduce the
work in k-space. but actually using 4096 processors
for MPI and then trying to tack on additional
parallelization through threading in the non-bonded
calculation, is the most promising approach.

the lammps-icms branch should allow you to do
just that. i've seen up to 4x speedup because of this.
also, using a single precision FFT can help, too.

I've been using your branch but have not explored OpenMP + MPI yet. Will
look into it as soon as I can.

Thanks,

valmor,

Hello,

I obtained the following timing data for a 1.2 million atoms simulation
on a Blue Gene P machine

lammps performance depends on much more than
just the number of particles. depending on the details
of the system you are running and the settings that you
are using, there are quite a few adjustable parameters
that might make a difference.

Indeed; should have sent more info. This system is dodecane:

are all atoms charged for this?
if only a part, you may see some
improvement by using

kspace_style pppm/cg

this will bypass uncharged atoms.
if your atoms have no charges,
then you should not use coul/long
and kspace at all.

some relevant data:

pair_style lj/cut/coul/long 15.0 15.0
kspace_style pppm 1.0e-4

this is already cranking up the coulomb/lj cutoffs
to a fairly large value (10-12 is more common).
but then again, the larger coulomb cutoff will
reduce the cost of the kspace part. so at the
limit of scaling, this is better, since the non-bonded
calculation parallelize much better.

bond_style harmonic
angle_style harmonic
dihedral_style opls
improper_style none
special_bonds lj/coul 0.0 0.0 0.5

neighbor 2.0 bin
neigh_modify delay 5

[...]

Loop time of 169.575 on 8192 procs for 1000 steps with 1216000 atoms

Pair time (\) = 56\.5647 \(33\.3568\) Bond time \() = 1.86976 (1.10262)
Kspce time (%) = 97.8776 (57.7195)

ouch!! as you can see, here the cost of
doing the 3d FFTs for PPPM are dominating.

Yes. In this system, when the average number of atoms per process drops
to less than 200, there is a big performance penalty. This coincides
with the point where the Kspce timing is greater than the Pair time.

200 atoms per processor is already very good.
you only get that far on a bluegene because the
processors are so much slower.

in any case. as soon as the kspace time starts
increase, you are basically done for. single precision
FFTs will cut the amount of data that needs to be
moved around in half, so that can give you a
small improvement. you can also try to balance
the non-bonded against kspace, but increasing
the coulomb cutoff (only). as a rule of a thumb:
the time of Pair and Kspace are about the
same magnitude when you reach optimal performance.

the only significant improvement that you can get
then is to keep the number of MPI tasks constant and
use threading/OpenMP on top of that to squeeze
some extra performance out of the nonbonded and -
perhaps - dihedrals.

cheers,
     axel.

[snip]

are all atoms charged for this?
if only a part, you may see some
improvement by using

Yes; all charged.

kspace_style pppm/cg

this will bypass uncharged atoms.
if your atoms have no charges,
then you should not use coul/long
and kspace at all.

some relevant data:

pair_style lj/cut/coul/long 15.0 15.0
kspace_style pppm 1.0e-4

this is already cranking up the coulomb/lj cutoffs
to a fairly large value (10-12 is more common).

I ran with 10 and the results are not good.

but then again, the larger coulomb cutoff will
reduce the cost of the kspace part. so at the
limit of scaling, this is better, since the non-bonded
calculation parallelize much better.

[snip]

Yes. In this system, when the average number of atoms per process drops
to less than 200, there is a big performance penalty. This coincides
with the point where the Kspce timing is greater than the Pair time.

200 atoms per processor is already very good.
you only get that far on a bluegene because the
processors are so much slower.

in any case. as soon as the kspace time starts
increase, you are basically done for. single precision
FFTs will cut the amount of data that needs to be
moved around in half, so that can give you a
small improvement. you can also try to balance
the non-bonded against kspace, but increasing
the coulomb cutoff (only). as a rule of a thumb:

Will give this a try.

the time of Pair and Kspace are about the
same magnitude when you reach optimal performance.

the only significant improvement that you can get
then is to keep the number of MPI tasks constant and
use threading/OpenMP on top of that to squeeze
some extra performance out of the nonbonded and -
perhaps - dihedrals.

Thanks for the insight.

If you are doing PPPM on a system with 150 atoms/proc
and (likely) smallish FFTs on 8192 procs, then I would
say those numbers are not atypical. Thats a small
system for that many procs.

Steve