[Q] Request clarification of BUILD_OMP and PKG_USER-OMP

Sebastian_Hutter · June 28, 2019, 10:42am

Hi,

MPI is not the only thing setting affinity, OMP does as well. If you have htop on your system, you can check the affinity actually in effect per thread very easily (select thread, press 'a').

This was (and is) a problem on my system, as my OMP version seems to *always* bind every thread to 0x1, which is of course completely counterproductive. Maybe try setting OMP_PROC_BIND=FALSE ? It should then inherit whatever MPI set for this rank.

See also:
<https://gcc.gnu.org/onlinedocs/libgomp/OMP_005fPROC_005fBIND.html#OMP_005fPROC_005fBIND>

Sebastian

_Chin_David · June 28, 2019, 4:41pm

Hi, Sebastian:

After fixing my error with the lmp options, I did try setting OMP_PROC_BIND=FALSE, but it did not change performance compared with not setting it at all. In both cases, running 10 MPI procs with 4 OMP threads gave about 1320 timesteps/sec (about 200 timesteps/sec slower than not using OMP threads at all).

Regards,
Dave

akohlmey · June 28, 2019, 5:37pm

Hi, Sebastian:

After fixing my error with the lmp options, I did try setting OMP_PROC_BIND=FALSE, but it did not change performance compared with not setting it at all. In both cases, running 10 MPI procs with 4 OMP threads gave about 1320 timesteps/sec (about 200 timesteps/sec slower than not using OMP threads at all).

for dense system like the test input you provided, MPI parallelization (especially when used in combination with per-core processor affinity) is often performing better, since the domain decomposition enhances cache locality. for threads one would have to come up with a way to construct locality promoting neighbor lists. not so easy. also, the USER-OMP styles have increasing overhead with increasing number of threads.

also, both the code in USER-INTEL and in USER-OMP are typically faster without threads than their corresponding non-threaded counterparts due to optimizations not present in the reference lj/cut pair style.

axel.