please always copy the list on your replies. this way people that have
the same questions later, can look up any explanations in the mailing
list archives. thank you.
Thanks a lot for your help. However, I don't quite get it why using OMP can
improve the efficiency in terms of separating the kspace calculation like
verlet/split. To my understanding, OMP scheme uses the same partition
scheme, .e.g. kspace calculations are distributed over all the threads,
you don't seem to understand what problem verlet/split is solving.
in general, especially, with a relatively small number of MPI tasks,
it is *most* efficient to use the regular verlet run_style. since
verlet/split can easily result in a load imbalance and thus
inefficiency. the reason why one still wants to use verlet split is
that the distributed parallel 3d FFTs and the necessary reordering of
charge density data and the resulting potential from distributed
domains to "pencils" and back to domains, doesn't scale so well. and
the overhead gets larger the more processors you use. up to the point,
where this starts to dominate and you don't scale anymore.
verlet/split addresses this, by keeping the number of processors that
handle kspace separate from the number of processors that do the rest
(well they have to be a multiple). this way you can scale farther,
since the non-bonded calculation parallelizes to much better.
using multi-threading has the same effect, but without the (potential)
load imbalance. the number of MPI tasks is reduced just as well and
thus the 3d-FFT scaling issue avoided as well. even when you'd be
using pppm without thread support. threading of non-bonded
interactions is as efficient (for a small to moderate number of
threads) as the MPI parallelization. the fact, that there is
additional threading *inside* pppm, is a bonus. of course, if you want
to go to the *very* extreme, you would be using both (as has been done
on the very large BG/Q at argonne national labs, for example, where
people developed verlet/split orginally and we made it compatible to
so unless you have a huge number of CPUs, you will be best off using
plain MPI and a small number of threads (2-4). you can also tweak
performance a bit by adjusting the coulomb cutoff. and only if you
reach the limit of kspace scaling, you want to crank up the number of
threads and use verlet/switch.