tersoff eam input "does not scale"

Hello

One user is with a "this input it doesn't scale" problem, which I, not being a physicist or chemist but an IT guy, was tasked to help.
More specfically, he complains that the time to solution of the simulation stops decreasing when using more than 4 cores. (1 MPI process/core, OMP_NUM_THREADS=1)

Attached goes the input files used.

I've found some possible clues in the list archives, but I'm not sure of what to make of it:

http://lammps.sandia.gov/threads/msg34987.html
http://lammps.sandia.gov/threads/msg02455.html
http://lammps.sandia.gov/threads/msg05564.html

The lammps version used is 1Feb14, with the following HW/SW setup:

HW: 2x Amd Opteron 6386 SE 256GB RAM,
OS: Centos 6.5 x64
FC: gfortran 4.4.7
CXX: g++ 4.4.7
MPI: openmpi 1.5.4 (stock centos 6.5 package)
BLAS/LAPACK: openblas 0.12
FFT: fftw 3.3.4 ( --enable-sse2 )

Also execution is very vanilla:
$ mpirun -np #cores lammps-mpi -in input

TIA,
Fabricio

coords (244 KB)

input.ceramic (2.45 KB)

Pd_u6.eam (35.6 KB)

sicn.tersoff (5.51 KB)

Hello

One user is with a "this input it doesn't scale" problem, which I, not being
a physicist or chemist but an IT guy, was tasked to help.
More specfically, he complains that the time to solution of the simulation
stops decreasing when using more than 4 cores. (1 MPI process/core,
OMP_NUM_THREADS=1)

yes, this is true and caused by specifics of the simulation input.
thus it is really a matter of your "customer" to pay attention and not
an IT problem. ...and i am speaking as a physical and theoretical
chemist, working primarily in HPC and doing a lot of IT-related stuff
(like operating an HPC cluster or teaching classes in an HPC master
program).

there are two contributing factors:
- for MPI parallelization LAMMPS does a domain decomposition that
assumes a homogeneous distribution of atoms through the volume of the
simulation box, but this is not the case in this input, which is
shaped as a cannonball stuck in a board. when using more than 4
domains, this leads to a severe load imbalance, because now the cut is
no longer going creating domains with a similar number of atoms.
- also the simulation is build as a hybrid model combining 3 different
types of interactions that have to be computed one after the other.
again, using more than 4 domains will lead to a load imbalance, since
no longer a similar amount of atoms of each type is present in each
chunk of the system.

i also believe that the initial configuration is not quite correct. as
the periodic images are overlapping in z direction leading to a very
higher energy. i would expect for this kind of system a non-periodic
boundary in z-direction.

anyway, here are some tips to improve the load balance:

after the read_data command, you can use the command

balance 1.0 shift xyz 5 1.1

to tune the size of the domains so they have approximately the same
number of atoms
this should improve scaling up to a point.

beyond that, you can try using OpenMP parallelization on top of MPI.
for that purpose, you add to the command line the flags: -pk omp 2 -sf
omp
which will use 2 OpenMP threads for each MPI tasks. 3 or 4 threads can
be tried as well, but usually there is not much gain beyond that.

one also has to keep in mind that the system is very small with only
7000 atoms. once the number of atoms goes below 1000 atoms per CPU
core, parallel scaling will become more and more difficult.

Attached goes the input files used.

I've found some possible clues in the list archives, but I'm not sure of
what to make of it:

none of what is mentioned in those thread seems related to this. in
any case, the main issue is a problem of your user not understanding
how to use LAMMPS efficiently. these issues are all explain in great
detail in the LAMMPS manual and in the publication describing the
LAMMPS parallelization approach.

axel.