severe load balance issue

Dear lammps users,

My job on HPC had severe load balance issues. About half of the nodes did not do any significant
work. I have check the src/timer.cpp ( version :31Mar17) and the MPI_Barrier(world) is already uncommented.

I use kokkos version of reax and this is the way I run my simulation on HPC.

NP=2560 # nbr of cores
NPN=40 # nbr of cores/node
LMP=lmp_kokkos
PARAM=in.pullout

mpirun -np $NP -ppn $NPN $LMP -k on -sf kk -pk kokkos newton on neigh half comm no -in $PARAM

I appreciate any suggestion regarding this issue.

Samaneh

Have you tried using fix balance?

Michal

Dear lammps users,

My job on HPC had severe load balance issues. About half of the nodes did
not do any significant
work. I have check the src/timer.cpp ( version :31Mar17) and the MPI_Barrier(world)
is already uncommented.

I use kokkos version of reax and this is the way I run my simulation on
HPC.

NP=2560 # nbr of cores
NPN=40 # nbr of cores/node
LMP=lmp_kokkos
PARAM=in.pullout

mpirun -np $NP -ppn $NPN $LMP -k on -sf kk -pk kokkos newton on neigh
half comm no -in $PARAM

​load balancing issues in LAMMPS have their origin in the MPI command line.
usually, it is caused be an uneven distribution of atoms in the parallel
subdomains.
please check out the documentation of the "balance" and "processors"
commands.
also, your system has to provide a sufficient amount of work to parallelize
over.
you have apparently requested 2560 individual CPU cores, how many atoms
does your system have?

axel.

Dear Axel and Michael,

Thanks for your comments. I will read about balance commands.

Regarding the cpu cores, I am using 64 nodes which each of nodes have 40 cores. My system have around 20.000 atoms which is not a lot(!). The reason that I use that many nodes is that the reax is too slow, I thought increasing the number of nodes will make the simulation faster. Apparently there are some other factor that I should consider.

Thanks again for your help.

Samaneh

Dear Axel and Michael,

Thanks for your comments. I will read about balance commands.

Regarding the cpu cores, I am using 64 nodes which each of nodes have 40
cores. My system have around 20.000 atoms which is not a lot(!). The reason
that I use that many nodes is that the reax is too slow, I thought
increasing the number of nodes will make the simulation faster. Apparently
there are some other factor that I should consider.

​most importantly, you should do benchmarks to determine what is the
optimal usage of your resources. no parallel code can parallelized
perfectly to an infinite number of processors.
mind you, using the balance command is only the​ second best option. the
best option is to guarantee an even distribution of atoms into subdomains
(e.g. for slab systems, this can be easily done via the processors
command). balance is meant to help in situations, where this not or only
partially possible. for notoriously bad cases, there is also the option to
switch to tiled communication and recursive bisectioning. in any case,
however, careful benchmarking and performance monitoring is a must.

axel.

In addition to the benchmarks, I might also suggest to look at the statistics printed in the log at the end of a run (A very short run might be enough). That will indicate how much your are loosing in communication, which is probably quite a lot! As a general idea, in solide state MD, I found >10 atoms per core very, very small! I might be wrong but curious, so keep me/us informed on the results of your benchmarks! :slight_smile:

Also, another thought. Load balancing is based on the number of particles only. In some specific cases, could it make sense to also consider the local structure (directional flow, localized complex interface…) in order to distribute the cores?

In addition to the benchmarks, I might also suggest to look at the
statistics printed in the log at the end of a run (A very short run might
be enough). That will indicate how much your are loosing in communication,
which is probably quite a lot! As a general idea, in solide state MD, I
found >10 atoms per core very, very small! I might be wrong but curious, so
keep me/us informed on the results of your benchmarks! :slight_smile:

​it is not only the percentage of communication that is an indication of
load imbalance (it is more an indication of reaching or exceeding the limit
of strong scaling), but the different between "min time", "max time" and
"avg time" or even simpler the "%varavg" column for the "Pair" category,
that gives a good indication of load-balance or -imb​alance.

Also, another thought. Load balancing is based on the number of particles
only. In some specific cases, could it make sense to also consider the
local structure (directional flow, localized complex interface...) in
order to distribute the cores?

​actually, that statement is not entirely true anymore. that are advanced
load-balancing options, e.g. time spent, number of neighbors, groups or
just arbitrary expressions ​as variables, that can be used as weights to
improve the load balancing. e.g. for inhomogeneous systems, load-balancing
on the number of neighbors or the cpu time or a mix of the two can be more
effective than just particle based balancing. that said, the particle based
balancing usually does get you most of the way, the additional factors
offer some "knobs" to improve the balancing.

axel.

Oh great! Thanks for the precisions!
I’m one year outdated on that function it seems (added 27 Sept 2016)… this what appends when I do less Read The Friendly Manual :slight_smile:

julien.