LAMMPS parallel optimization techniques

Dear LAMMPS Developers,

dear alan,

Dear LAMMPS Developers,

about newer parallel programming paradigms for scaling to a
large number of processes. Some examples are: the reduction
in use of implict synchronization barriers such as two-sided
communication, taking advantage of lower latency between
processes on the same node, and dynamic scheduling using
light-weight threads. Here at the Engineer Research and
Development Center (ERDC) some colleagues and I at the
Information and Technology Laboratory (ITL) intend to develop
prototypes (archetypes) of commonly used algorithms in order to
provide examples of the use of parallel programming techniques
that must become commonplace for higher levels of scaling on
multicore nodes.

I’ve been told that LAMMPS is a good example of an efficient
parallel program and that I can learn from the developers as to

what is that state of the art for an actual application with
regard to the “modern” programming techniques described above.

it is not so much “techniques”, but the fact that LAMMPS uses
a parallelization strategy, that is adapted to its dominant use
(simulation of dense atomic systems) that makes it scale so
far based on domain decomposition. in terms of “technique”
i would say it is more “conservative” and “simple” than “modern”.
you would have to look at NAMD, which is based on charm++
out of sanjay kale’ s group that has a more modern approach
(i.e. only 10 years old).

I’ve looked at the list of papers in http://lammps.sandia.gov/papers.html
and have read some abstracts and downloaded some papers from
that list which describe parallel programming aspects, but I’ve
not found a good description of techniques used in LAMMPS.
I see that in the LAMMPS code the function MPI_Irecv() (which
begins a nonblocking receive returns a request handle) is used
in many places. Are there any other optimization techniques that
are not apparent? Is there a write-up about parallel optimization
techniques in LAMMPS?

well, there are the GPU and USER-CUDA package that add
GPU acceleration to the package (and there are papers related
to it) and steve just recently added a package with multi-threaded
variants of the most computationally intense modules. those will
soon be replaced by a much more efficient second generation
that will also multi-thread some code functionality.
then there is an attempt to add load-balancing through adjusting
the sizes of the (previously equally sized) subdomains that was
started by christoph kloss in his LIGGGHTS variant of the LAMMPS
code that adds features for discrete element modelling and a
coupling to continuum models. in those cases, the standard
assumption from LAMMPS of homogeneous particle density
is not true and thus leads to load imbalances. adding multi-threading
and a loadbalancer scheme should help to reduce those issues.
check out the last lines of this post, for example.

http://sites.google.com/site/akohlmey/news-and-announcements/improvedairebopairstyleforlammpsreleased

The greatest number of papers about parallelization techniques
in the LAMMPS list of papers are in the earlier years, 2001-2002.
As hardware changes, has it become evident that additional
parallelization techniques need to be employed for better scaling?

yes. hybrid multi-threading + MPI is needed to scale further
and be more parallel efficient. you can see a discussion on
the issues and results (using the “older”, less efficient
multi-threading code) here:
http://sites.google.com/site/akohlmey/software/lammps-icms/lammps-icms-tms2011-talk.pdf?attredirects=0&d=1

my outlook on the future is, that we’ll have some kind of merge
between multi-threading and GPU support, as those are becoming
more similar to each other, and if OpenCL can become as effective
as CUDA perhaps even using just one single programming paradigm.

the prevailing trend in LAMMPS, however, is the inclusion of
more complex models, which are more computationally intense
and hence the parallelization itself is not necessarily a big issue.

the biggest obstacle at the moment seems to be treatment
of long-range electrostatics (“the curse of the k-space”).
all existing optimizations there are focused on reducing
its impact (using single precision data, or hybrid OpenMP+MPI)
or avoid the calls, but that is just postponing the problem,
not solving it.

HTH,
axel.

i forgot to mention, that a lot of improvements to LAMMPS
over the last years are due to optimizations of compute
intensive tasks by using more efficient algorithms/strategies
or by changing things so that cache/register utilization is better.

cheers,
axel.

Dear Dr. Alex Kohlmeyer,
    Thank you for the detailed description concerning parallelization
in LAMMPS. The information is very helpful. Even your comments
not related to "modern" parallel techniques has been helpful as
"food for thought" about what we should look at. For example, not
long ago, for finding nearest neighbors by putting particles in buckets,
I found an improvement by ordering the particles for whom the nearest
neighbors were sought according to the buckets into which they were
placed, so after given search for NN, the next search might have the
information in cache. My point is that your last comment about cache,
and register usage, though not directly "parallelization", is important
for us to remember.
    By the way, we realize that programming for GP-GPUs will be important,
but that would be an entire project in itself.

Best regards,
Alan

Dear Dr. Alex Kohlmeyer,
Thank you for the detailed description concerning parallelization
in LAMMPS. The information is very helpful. Even your comments
not related to “modern” parallel techniques has been helpful as
“food for thought” about what we should look at. For example, not
long ago, for finding nearest neighbors by putting particles in buckets,
I found an improvement by ordering the particles for whom the nearest
neighbors were sought according to the buckets into which they were
placed, so after given search for NN, the next search might have the
information in cache. My point is that your last comment about cache,

the most extreme application of that principle is in the HOOMD-blue
MD code by josh anderson. http://codeblue.umich.edu/hoomd-blue/
which sorts based on a space filling hilbert curve.

LAMMPS has a simpler type of sort for the same reason, too.

and register usage, though not directly “parallelization”, is important
for us to remember.

yes. and particularly the “false sharing” issues when using
multi-threading. identifying and resolving many of those is
the main reason for the improved parallel efficiency of my
second generation multi-threaded code.

By the way, we realize that programming for GP-GPUs will be important,
but that would be an entire project in itself.

the point i wanted to make is a different one:
what i expect that “modern” parallelization techniques
describe, is that you have to optimize for a heterogeneous
environment where you may have different ‘facilities’
with different characteristics, and so the work items
will have to be balanced between those in a smart way.
this currently requires a lot of permutations in benchmarks,
since there are different ‘dimensions’ in which direction
parallelization can be applied. that is difficult to master,
even for an expert. figuring out ways to make this more
uniform again, at least on the application user level,
would automatically make people be more efficient.

cheers,
axel.