[lammps-users] performance

_Sabine_Leroch · January 5, 2011, 5:08pm

Dear Axel

Thank you for the reply. I think the processor keyword does not help in my case, since the particles are situated in the middle of the simulation box, such that all three directions are effected by the vacuum.

What do you mean with ‘dual level parallelism
with OpenMP-enabled’? I have hardly experience with MPI coding. Is it necessary for this purpose to change code in lammps? Or is it sufficient to add some lines in the qsub script to activate some OMPI intrinsic functions?

Best regards
Sabine

Axel Kohlmeyer [email protected] 01/04/11 5:23 PM >>>
Dear lammps users

I want to study the interaction between two wetted rigid particles
(pair_style lj/cut/coul/cut). Only the water molecules are allowed to move
while the atomistic nano particles are fixed in position. Since I want to
observe the formation of water bridges between the particles and avoid
interactions of particles with their own images caused by periodic boundary
conditions, considerable large parts of the simulation cell are empty
(vacuum). I am carrying out simulations on an HPC (
TU.it Information Technology Solutions. ZID | TU.it
) consisting of modern Intel quad core dual processor machines connected by
Infiniband (40Gbps). The performance of the HPC is comparable with the one
on the lammps webpage running the systems investigated there. However, if
parts of the simulation cell are empty performance strongly degrades.
Simulations on 8 cores (one node) need the same time as on 16 cores (2
nodes), while those on 32 cores need even longer times. The communication
time rises extremely with every additional node.
I reduced the simulation box now as much as possible, still I cannot use
more than 16 cores. Is there any setting in the input file I can choose to
improve the performance although there is empty space in the simulation box?

this is likely to be a load balancing problem. lammps uses a fixed
domain distribution based on box dimension. if you have large vacuum
areas, then you create a load imbalance.

check out the processors keyword. if you use at most two MPI tasks
in the direction where you have the vacuum, you should get better
load balancing and thus better scaling.

the second option would be to switch to using dual level parallelism
with OpenMP-enabled or GPU-accelerated pair styles. this allows to
keep the domains fairly large and thus have less overhead.

cheers,
axel.

akohlmey · January 5, 2011, 5:32pm

Dear Axel

dear sabine,

Thank you for the reply. I think the processor keyword does not help in my
case, since the particles are situated in the middle of the simulation box,
such that all three directions are effected by the vacuum.

at this point it is probably best, if you would provide an exemplary input
(just plain MD, nothing fancy) and reference timings with it, if possible.

that will make it easier to discuss. even if there is vacuum around your
system i all directions, there are a lot of scenarios, where the default
processor grid calculation can be inefficient, as that assumes a
homogeneous distribution of atoms.

What do you mean with 'dual level parallelism
with OpenMP-enabled'? I have hardly experience with MPI coding. Is it

OpenMP is a different parallelization paradigm based on threading.
OpenMP parallelization is orthogonal to MPI parallelization, so you
can combine both. the way lammps is structured and particularly how
the memory accesses happen due to using neighbor lists, it is usually
a little bit faster to use MPI parallelization-only, but if you have
communication bandwidth limitations or load-balancing issues, or want
to scale to a very large number of processor cores, then you scale
further and are in total the fastest when using a combination of MPI
and OpenMP/Threading parallelization. i've seen improvements of
up to 4x, for systems with homogeneous densities. in case of inhomogeneous
densities (e.g. two fused nanotubes in vacuum) already at 8 processor cores,
it may be faster to use OpenMP over MPI or a combination of MPI
and OpenMP.

necessary for this purpose to change code in lammps? Or is it sufficient to
add some lines in the qsub script to activate some OMPI intrinsic functions?

you need a few (small) changes to the lammps code itself and then
rewrite the pair style (and other time consuming) classes to utilize
the threaded code. those will need to be compiled with special
compiler flags that activate the OpenMP directives and then you
select the number of threads per MPI task by setting the environment
variable OMP_NUM_THREADS or using the nthreads keyword
(the latter overrides the former). this is all implemented in the
LAMMPS-ICMS branch available from here.

http://sites.google.com/site/akohlmey/software/lammps-icms

in general, to learn more about parallelization and optimization of codes
using MPI, OpenMP/threading, i recommend the online self-study
coursed at the CI-Tutor webpage: http://www.citutor.org

cheers,
axel.