on LAMMPS parallelizing strategy

Dear all,

I have a very general question regarding the parallelizing strategy
used in LAMMPS, which currently uses a spatial-decomposition
algorithm.

This approach is quite efficient for very large systems. However, when
one needs to investigate not-so-large systems (for a very long time),
the performance of the spatial-decomposition algorithm degrades very
fast with the number of processors. To give orders of magnitude, for
the system sizes I usually consider, the parallel performance usually
starts to degrade for a number of processors as small as typically 8.

My question is therefore: would it be possible to couple this
spatial-decomposition algorithm with a functional decomposition
algorithm, used e.g. for each domain of the spatial decomposition?
This would enable a maximum flexibility with regard to the
optimization of parallel performance...

Best regards,
Laurent

Dear all,

I have a very general question regarding the parallelizing strategy
used in LAMMPS, which currently uses a spatial-decomposition
algorithm.

This approach is quite efficient for very large systems. However, when
one needs to investigate not-so-large systems (for a very long time),
the performance of the spatial-decomposition algorithm degrades very
fast with the number of processors. To give orders of magnitude, for
the system sizes I usually consider, the parallel performance usually
starts to degrade for a number of processors as small as typically 8.

have you tested using hybrid MPI + OpenMP parallelization? GPU
acceleration? what kind of potentials are you using and have you checked
for load imbalances, and where exactly the scaling bottle neck lies?

My question is therefore: would it be possible to couple this
spatial-decomposition algorithm with a functional decomposition
algorithm, used e.g. for each domain of the spatial decomposition?
This would enable a maximum flexibility with regard to the
optimization of parallel performance...

i don't think there is so much to gain for most cases, since certain steps
have to be done in order (you cannot do neighbor list rebuilds and
nonbonded interactions at the same time, the same goes for many fixes that
need to be executed in order to give consistent results), the time spent in
different parts of the calculation that can be done concurrently is usually
not balanced (=> there is not much to gain) and finally, if you split
things across multiple processors, you also have to communicate the results.

that being said, you *can* split kspace of from the rest already using
verlet/split and the OpenMP styles use a particle decomposition instead of
a domain decomposition and in most systems you should be able to get better
strong scaling with a combination of OpenMP and MPI.

axel.

Dear Axel,

Thanks for pointing me to run_style verlet/split, I was not aware of
this possibility and I will give it a try. I have tried MPI+OpenMP /
GPU computing a long time ago, and was not quite convinced by the
performances. But I should definitely try again. Regarding load
imbalances, I have tried to use the recent fix balance, with some
improvement. The scaling bottle neck when increasing the number of
processors is the increase in communication load due to the
ridiculously small number of atoms I a using (as compared to current
standards). In fact I asked this question in principle since I don't
have that much possibilities to run LAMMPS on a huge number of
processors (I'm always amazed to see people running their jobs on
thousands of processors)...

Best regards,
Laurent

Dear Axel,

Thanks for pointing me to run_style verlet/split, I was not aware of
this possibility and I will give it a try. I have tried MPI+OpenMP /
GPU computing a long time ago, and was not quite convinced by the
performances. But I should definitely try again. Regarding load

it may be helpful if you could provide a typical example for benchmarking
purposes, some timings and a description of the hardware that you were
running on. there are all kinds of little tricks that one can use to
squeeze out additional performance. both the GPU and the USER-OMP package
have seen improvements over time.

imbalances, I have tried to use the recent fix balance, with some

improvement. The scaling bottle neck when increasing the number of
processors is the increase in communication load due to the
ridiculously small number of atoms I a using (as compared to current

how small is small?

standards). In fact I asked this question in principle since I don't
have that much possibilities to run LAMMPS on a huge number of
processors (I'm always amazed to see people running their jobs on
thousands of processors)...

there are plenty of large(r) scale resources available to researchers that
cannot find the required CPU time locally. most of the time the only cost
is having to write a convincing proposal and the occasional progress
report. most people that run on thousands of processors do this on external
resources. for rather small projects, it may often help to keep your ears
to the ground and ask around. i've often seen smaller/medium size resources
operated by individual groups that have a lot of unused compute capacity.
for the resources that i have been managing i regularly have recruited
"bottom feeder users", i.e. people that are willing to have their jobs only
run when there are otherwise idle resources, and most of the time there is
quite a bit in the leftovers, so it is a win-win scenario (a full machine
and more happy customers). i've also "found" a lot of CPU time this way,
when i was a grad student and postdoc.

ciao,
    axel.