Simulation Balance Help

Kasra · September 21, 2015, 7:56pm

Hi all,

I'm simulating a big spherical colloid with a diameter of 10nm in a box including water and a polymeric(PMMA) wall and the size of the simulation box is 15nm X 15nm X 16nm.
The system without the colloid, using only cpus, has its best performance while running on 800 procs (approximately 450 particles per procs on average) and it runs with a rate of almost 9 ns/day (please see attached log file pmma_water.log for details).
However, after creating the colloid in the system and deleting the particles that is now occupied by the colloid, I'll have a system with a large colloidal particle (therefore a large void) but less overall particles in the system. Since I have different size particles and therefore wide range of cutoffs, I'm using multi neighbor list. Keeping the same number (approximately 450) of particles per procs, I'm using 660 procs in this case. However, this simulation runs more than 4 times slower than it was before adding the colloid which is understandable since now the colloid has created a large void in the system and many of the processors are left with no particles and are running idle! (please see attached log file pmma_water_colloid.log for details).
Next, I tried using LAMMPS capability of balancing the simulation to evenly distribute the load over the processors. I used "fix balance" to balance the load but I was unable to get any better results (please see attached log file pmma_water_colloid_balance.log for details).

MPI timing breakdown shows that in the case of system without the colloid most time consuming operation are Pair and Kspace with 32 and 50, respectively. However, after adding the colloid, Comm and Neigh become the bottleneck and stays the same even after applying "fix balance". Also after applying "fix balance" and writing out the partitioning pattern consecutively using "dump subbox" doesn't seem to work as I was expecting the partition in which the colloid is located in be large as no other particles can be around it within 5nm (its radius) from its center (I have attached domain partitions images in partitions_evolution.tar.gz where I'm showing the colloid (blue sphere) with a small diameter so partitions can be visible. )

Should I expect better performance for this system using "fix balance" or it needs other treatments?

I would appreciate any comment on how I can improve the simulation while I have this big particle (large void) in the system.

Thank you!

Kasra.

pmma_water_colloid_balance.log (8.72 KB)

pmma_water_colloid.log (7.91 KB)

pmma_water.log (6.14 KB)

partitions_evolution.tar.gz (678 KB)

akohlmey · September 21, 2015, 8:19pm

Hi all,

I'm simulating a big spherical colloid with a diameter of 10nm in a box
including water and a polymeric(PMMA) wall and the size of the simulation
box is 15nm X 15nm X 16nm.
The system without the colloid, using only cpus, has its best performance
while running on 800 procs (approximately 450 particles per procs on
average) and it runs with a rate of almost 9 ns/day (please see attached
log file pmma_water.log for details).
However, after creating the colloid in the system and deleting the
particles that is now occupied by the colloid, I'll have a system with a
large colloidal particle (therefore a large void) but less overall
particles in the system. Since I have different size particles and
therefore wide range of cutoffs, I'm using multi neighbor list. Keeping the
same number (approximately 450) of particles per procs, I'm using 660 procs
in this case. However, this simulation runs more than 4 times slower than
it was before adding the colloid which is understandable since now the
colloid has created a large void in the system and many of the processors
are left with no particles and are running idle! (please see attached log
file pmma_water_colloid.log for details).
Next, I tried using LAMMPS capability of balancing the simulation to
evenly distribute the load over the processors. I used "fix balance" to
balance the load but I was unable to get any better results (please see
attached log file pmma_water_colloid_balance.log for details).

MPI timing breakdown shows that in the case of system without the colloid
most time consuming operation are Pair and Kspace with 32 and 50,
respectively. However, after adding the colloid, Comm and Neigh become the
bottleneck and stays the same even after applying "fix balance". Also after
applying "fix balance" and writing out the partitioning pattern
consecutively using "dump subbox" doesn't seem to work as I was expecting
the partition in which the colloid is located in be large as no other
particles can be around it within 5nm (its radius) from its center (I have
attached domain partitions images in partitions_evolution.tar.gz where I'm
showing the colloid (blue sphere) with a small diameter so partitions can
be visible. )

Should I expect better performance for this system using "fix balance" or
it needs other treatments?

please note that at 800 CPUs you have less than 400 atoms per processor,
which is likely close to the limit of scaling for a homogeneous system.
on top of that, due to the nature of interactions to be computed for the
colloidal particles, you will have to spend more time in building neighbor
lists and communicating extra data, so you *have* to expect some slowdown
because of that. thus the extra cost comes from primarily from the nature
of interactions of the colloidal particle and less from the load imbalance.

in addition, the fact that you have so many small domains, makes it
difficult to do any efficient load balancing for a local inhomogeneity.
thus a possible way to improve performance would be to run in hybrid
MPI/OpenMP mode. that will result in larger domains and should somewhat
reduce the communication overhead as well.

axel.

Kasra · September 21, 2015, 8:40pm

Thanks for the prompt response Axel.
I’m use 800 CPUs when there is no colloid in the system which gives me 450 atoms per processors (water_colloid.log). And I use 660 CPUs for the simulation with the colloid which gives me 460 atoms per processors.

When using a hybrid MPI/OpenMP, is there any rule of thumb for specifying number of MPI processors and OpenMP threads as it is almost 400 atoms per processors for MPI for a homogeneous system?
Is that right to say that each thread now should be in charge of an approximately 400 atoms?

Thank you,

Kasra.

akohlmey · September 21, 2015, 8:54pm

Thanks for the prompt response Axel.
I'm use 800 CPUs when there is no colloid in the system which gives me 450
atoms per processors (water_colloid.log). And I use 660 CPUs for the
simulation with the colloid which gives me 460 atoms per processors.

When using a hybrid MPI/OpenMP, is there any rule of thumb for specifying
number of MPI processors and OpenMP threads as it is almost 400 atoms per
processors for MPI for a homogeneous system?
Is that right to say that each thread now should be in charge of an
approximately 400 atoms?

there is no such rule. this situation is much more complex and this is not
a simple optimization.
a) you have to check whether your MPI library does any kind of processor
affinity setting. if the setting is per core, it needs to be changed to per
socket.
b) the OpenMP pair styles have some additional optimizations, but also they
OpenMP implementation is optimized for a small number of threads per MPI
rank. the more threads you have, the more overhead you incur. so try with 2
threads per MPI rank and increase the number of threads (and
correspondingly decrease the number of MPI ranks) until you don't see any
speedup or you have one MPI rank per socket.
c) since OpenMP parallelization is using particle decomposition instead of
domain decomposition, you should get larger subdomains and that should make
improve the effectiveness of the domain division plane shifting based load
balancer and also reduce the fraction of ghost particles that need to be
communicated per subdomain.
d) to get a good impression of how much you benefit from OpenMP+MPI vs.
load balancing, i would suggest to run the same test without the colloid
particle as well.
e) it is essentially impossible to predict where you strong scaling limit
is. using MPI+OpenMP should give you some benefit in performance with the
same total number of CPU cores used. if you can push to using more CPU
cores is hard to tell, since that will re-introduce some of the overhead
that you try to remove by using MPI+OpenMP. there is going to be some
optimum somewhere, but due to the non-linearity and the correlations of how
the various parameters affect performance, it is impossible to make
specific predictions.
f) breaking the performance down to (average) neighbors per atom is not
sufficient. the situation is much more complex.

axel.

Kasra · September 28, 2015, 8:59pm

Hi Axel,

I appreciate the clear how-to that you provided. It took me a while to figure out the details that you mentioned, as it required better knowledge of the computer architecture and the MPI library. Here I’ll transcribe my effort and the results for future references and ask for comment and help:

– Details of the cluster that I’m using:

Two 10-core 2.8 GHz E5-2680v2 Xeon processors
64 GB memory
– My MPI library: MVAPICH 2.0 which uses Hydra as the process manager

– The CPU (core) affinity is enabled by default (i.e. MV2_ENABLE_AFFINITY=1)

– I disabled the affinity but then each process was choosing all the cores on a node and causing a huge performance degradation.

– I figured out that I have to use Hydra flags to control the affinity after disabling the default behavior. I use “-bind-to core:n” to set the affinity where I choose “n” to be equal to the number of OpenMP threads that I’m going to set for each MPI process. Using “-bind-to socket” caused each process on each node have access to all the cores on a socket and caused a great degradation. So I’m not really sure if I’m doing the right settings to satisfy the item (a) of the suggestion.

– I use “-ppn” to specify number of required processes per node (MPI ranks per node).

– I’m using an Intel Compiler, I set " KMP_AFFINITY=compact " and not using other options because nodes on the cluster that I’m running on are not using Hyperthreading. Also I tried “scatter” but I didn’t notice any performance enhancement.

After setting all these parameters and experimenting with the number of MPI ranks and number of OMP threads, I have summarized the simulation timing in the attached files named based on the number of MPI ranks and OMP threads that have been used. In all cases I’m using balancing as: fix 1 all balance 500 1.05 shift xyz 20 1.05 out tmp.balance

== The best timing that I could get was for using 380 MPI ranks and 2 OMP threads (380MPI_2OMP.log) which is approx. 2.5 ns/day, however, a pure MPI simulation with 660 MPI runs with approx 2 ns/day. So I was wondering if this is expected or not and should I get a hugely better performance if I magically find the optimum configuration? However, as Axel mentioned it’s almost “impossible to make specific predictions”. But I was wondering if experts can get a clue based on my experimentation and could comment on that.

== I was not able to use 4 OpenMP threads as the simulation would get stuck and not continue. For example for one of the cases it’s as follow:

-------------CPU AFFINITY-------------
RANK:0 CPU_SET: 0 1 2 3
RANK:1 CPU_SET: 4 5 6 7
RANK:2 CPU_SET: 8 9 10 11
RANK:3 CPU_SET: 12 13 14 15
RANK:4 CPU_SET: 16 17 18 19
-------------------------------------
LAMMPS (11 Sep 2015-ICMS)
using 4 OpenMP thread(s) per MPI task
using multi-threaded neighbor list subroutines
using multi-threaded neighbor list subroutines

And it will stay forever there and simulation doesn’t start. I see this behavior when I’m using 4 threads no matter how many MPI ranks I choose.

== I use delete_bonds in my original simulations, however, during the experimentation I encountered situation where the simulation would get stuck when It was getting to deleting the bonds:

:
:
Neighbor list info …
3 neighbor list requests
update every 1 steps, delay 0 steps, check yes
max neighbors/atom: 100000, page size: 1000000
master list distance cutoff = 102
ghost atom cutoff = 102
binsize = 7, bins = 22 22 23
Deleting bonds …

This behavior is not reproducible and sometimes it happens without any error and it just stays there forever but once it crashed at the deleting_bonds with the following error:

:
:
Neighbor list info …
3 neighbor list requests
update every 1 steps, delay 0 steps, check yes
max neighbors/atom: 100000, page size: 1000000
master list distance cutoff = 102
ghost atom cutoff = 102
binsize = 7, bins = 22 22 23
Deleting bonds …
ERROR on proc 79: Failed to reallocate 1966080 bytes for array atom:x (…/memory.cpp:66)
[cli_79]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 79

I would appreciate if you take your time and comment on these issues.

Cheers,
Kasra.

660MPI_1OMP.log (2.34 KB)

380MPI_2OMP.log (2.15 KB)

330MPI_2OMP.log (2.15 KB)

132MPI_5OMP.log (2.05 KB)

66MPI_10OMP.log (2.01 KB)