valmor,
Hello,
I obtained the following timing data for a 1.2 million atoms simulation
on a Blue Gene P machine
lammps performance depends on much more than
just the number of particles. depending on the details
of the system you are running and the settings that you
are using, there are quite a few adjustable parameters
that might make a difference.
also, with many multi-core CPUs these days, you may
be better of by not using all processor cores for MPI
level parallelization and/or use multi-level parallelism,
e.g. in the form of OpenMP + MPI. i have no practical
information on how this works on BG/P, but you can
find some explanation and a poster demonstrating
the performance boost on different machines, including
a Cray XT5 on this page: http://goo.gl/4fcq
# procs min
128 76.95
256 39.21
512 20.17
1024 10.35
2048 5.41
3072 3.73
4096 2.88
5120 2.37
6144 3.21
7168 2.54
8192 2.82
I am wondering whether this is what I should be getting and if not what
improvements can be made. Below follows additional info for two cases.
without knowing the exact system, it is impossible to comment on this.
lammps doesn't do load balancing, and thus is dependent on having
a roughly uniform particle density across the whole simulation system.
also, you have to keep in mind, that with increasing system size, any
operation that requires a collective operation or all-to-all communication
will become increasingly expensive and thus limit scaling. finally, there
is an intrinsic degradation of parallel efficiency, due to serial overhead
depending on how much i/o and other non-parallel operations you have
in your input.
Thanks in advance for inputs.
--
Valmor
Loop time of 169.575 on 8192 procs for 1000 steps with 1216000 atoms
Pair time (\) = 56\.5647 \(33\.3568\)
Bond time \() = 1.86976 (1.10262)
Kspce time (%) = 97.8776 (57.7195)
ouch!! as you can see, here the cost of
doing the 3d FFTs for PPPM are dominating.
at this point, you cannot expect much improvements
unless you are using multi-level parallelism.
what you could do, is to crank up the coulomb
cutoff in real space (only) and thus reduce the
work in k-space. but actually using 4096 processors
for MPI and then trying to tack on additional
parallelization through threading in the non-bonded
calculation, is the most promising approach.
the lammps-icms branch should allow you to do
just that. i've seen up to 4x speedup because of this.
also, using a single precision FFT can help, too.
cheers,
axel.