Performance Degradation in Neighbor List Construction in OPENMP Package

I observe a performance-degradation for neighbor list construction. Steps to reproduce (it is system dependent, some of MPI ranks are slower):

$ cat in.sc
region region  block 0 80 0 80 0 80 units box
create_box     1 region
lattice        sc 8
create_atoms   1  box
mass           1 1
neighbor       0.0 bin
neigh_modify   delay 0 every 1 check no
pair_style     lj/cut 1
pair_coeff     * * 1 1
fix            nve  all  nve
timer          full sync
run            1
$ OMP_NUM_THREADS=15 mpiexec -n 8 lmp_omp -in in.sc0 | awk '/^Neigh   \|/'
Neigh   | 0.85879    | 0.99847    | 1.97471    |   3.3 |  97.5 | 50.03

I suspect that the performance degradation may be caused by false sharing of different MyPage<int> instances. For example, in files like src/OPENMP/npair_half_bin_atomonly_newton_omp.cpp. If I add padding to MyPage class the performance improves and becomes consistent across MPI ranks:

modified   src/my_page.h
@@ -91,6 +91,7 @@ template <class T> class MyPage {
   int status() const { return errorflag; }
 
  private:
+  char padding[1024];
   T **pages;    // list of allocated pages
   T *page;      // ptr to current page
   int npage;    // # of allocated pages

@slitvinov Thanks for your report. The data size of the MyPage instance is 64 bytes, so there should only be false sharing if the initial memory address is not on a 64-byte aligned address (default on x86 Linux is 16 byte alignment for default malloc and C++ new). Thus a padding of 64 bytes should be sufficient.

I don’t see such a drastic difference (factor 2 between slowest and fastest) when testing this change. Your MPI command line does not have any provisions for processor binding and placement of MPI ranks. Is this handled elsewhere? And how is the assignment of MPI tasks and threads in relation to the hardware you are running on?

@akohlmey, thank you for looking into this. Let me dump the tests I ran on the cluster, hope it is of some help. I’ve changed the configuration (more atoms, ten steps ). lmp_a is the baseline, lmp_b is with padding, the outputs are for three consecutive runs

$ cat in.sc
region region   block 0 160 0 160 0 160 units box
create_box 1    region
lattice         sc 8
create_atoms    1  box
mass            1 1
neighbor        0.0 bin
neigh_modify    delay 0 every 1 check no
pair_style      lj/cut 1
pair_coeff      * * 1 1
fix		        nve  all  nve
timer           full sync
timestep        0
run             10

$ export OMP_PROC_BIND=FALSE
$ OMP_NUM_THREADS=30 srun --mpi=pmi2 --jobid=1602 -n 16 lmp_a -sf omp -in in.sc  | awk '/^Neigh   \|/' | sed 1q
Neigh   | 0.7534     | 1.6972     | 3.7739     |  57.6 |2545.2 | 30.63
Neigh   | 0.75636    | 1.7308     | 3.7505     |  52.9 |2554.2 | 31.24
Neigh   | 0.62556    | 1.5402     | 2.9796     |  46.3 |2505.5 | 32.24

$ OMP_NUM_THREADS=30 srun --mpi=pmi2 --jobid=1602 -n 16 lmp_b -sf omp -in in.sc  | awk '/^Neigh   \|/' | sed 1q
Neigh   | 0.58162    | 0.633      | 0.72198    |   4.8 |2044.1 | 25.67
Neigh   | 0.58698    | 0.64302    | 0.76628    |   6.5 |2051.5 | 25.41
Neigh   | 0.58115    | 0.63367    | 0.78549    |   6.3 |2045.5 | 25.07

$ export OMP_PROC_BIND=TRUE
$ OMP_NUM_THREADS=30 srun --mpi=pmi2 --jobid=1602 -n 16 lmp_a -sf omp -in in.sc  | awk '/^Neigh   \|/' | sed 1q
Neigh   | 0.6112     | 1.1819     | 2.1394     |  37.1 |2408.1 | 30.13
Neigh   | 0.60916    | 1.0589     | 2.0349     |  37.5 |2333.8 | 27.94
Neigh   | 0.61074    | 1.1275     | 2.0526     |  39.8 |2361.6 | 29.33

$ OMP_NUM_THREADS=30 srun --mpi=pmi2 --jobid=1602 -n 16 lmp_b -sf omp -in in.sc  | awk '/^Neigh   \|/' | sed 1q
Neigh   | 0.58082    | 0.62216    | 0.71803    |   5.1 |2025.1 | 25.33
Neigh   | 0.57067    | 0.61375    | 0.71362    |   5.0 |2012.5 | 25.03
Neigh   | 0.56845    | 0.6386     | 0.73767    |   6.6 |2040.5 | 25.81

OpenMP envirment is

$ OMP_DISPLAY_ENV=TRUE srun --mpi=pmi2 --jobid=1607 -n 1 lmp_a -in abc
OPENMP DISPLAY ENVIRONMENT BEGIN
  _OPENMP = '201511'
  OMP_DYNAMIC = 'FALSE'
  OMP_NESTED = 'FALSE'
  OMP_NUM_THREADS = '30'
  OMP_SCHEDULE = 'DYNAMIC'
  OMP_PROC_BIND = 'TRUE'
  OMP_PLACES = '{0},{15},{1},{16},{2},{17},{3},{18},{4},{19},{5},{20},{6},{21},{7},{22},{8},{23},{9},{24},{10},{25},{11},{26},{12},{27},{13},{28},{14},{29}'
  OMP_STACKSIZE = '0'
  OMP_WAIT_POLICY = 'PASSIVE'
  OMP_THREAD_LIMIT = '4294967295'
  OMP_MAX_ACTIVE_LEVELS = '2147483647'
  OMP_CANCELLATION = 'FALSE'
  OMP_DEFAULT_DEVICE = '0'
  OMP_MAX_TASK_PRIORITY = '0'
  OMP_DISPLAY_AFFINITY = 'FALSE'
  OMP_AFFINITY_FORMAT = 'level %L thread %i affinity %A'
OPENMP DISPLAY ENVIRONMENT END

MPI ranks are on different nodes.

$ OMP_NUM_THREADS=30 srun --mpi=pmi2 --jobid=1607 -n 16 hostname
slurm-chg-compute-7-0
slurm-chg-compute-7-1
slurm-chg-compute-7-3
slurm-chg-compute-7-6
slurm-chg-compute-7-4
slurm-chg-compute-7-2
slurm-chg-compute-7-5
slurm-chg-compute-7-8
slurm-chg-compute-7-7
slurm-chg-compute-7-9
slurm-chg-compute-7-12
slurm-chg-compute-7-10
slurm-chg-compute-7-11
slurm-chg-compute-7-15
slurm-chg-compute-7-13
slurm-chg-compute-7-14
$  srun --mpi=pmi2 --jobid=1607 -n1 lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                30
On-line CPU(s) list:   0-29
Thread(s) per core:    2
Core(s) per socket:    15
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) CPU @ 3.10GHz
Stepping:              7
CPU MHz:               3100.202
BogoMIPS:              6200.40
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              25344K
NUMA node0 CPU(s):     0-29
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc eagerfpu pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 arat avx512_vnni md_clear spec_ctrl intel_stibp arch_capabilities

Please note that a) the OPENMP package was not designed with large thread counts in mind. You will be better off using KOKKOS for that and you need to carefully study and test the various neighborlist options and thread modes, and b) there is very little to gain from hyperthreading, which seems to be active in your case.

What is a bit strange is that you seem to be running inside a virtual machine and that you seem to have only one physical CPU per node. What I am also missing is the information about the placement of MPI tasks and their processor/memory affinity. If that is inconsistent with the hardware and the use of threads, you may see bogus results from using the threads.

Here are timing results from my desktop with a quad-core cpu with hyperthreading enabled w/o padding:

Loop time of 2.35763 on 1 procs for 1 steps with 4096000 atoms
Loop time of 1.30233 on 2 procs for 1 steps with 4096000 atoms
Loop time of 0.833176 on 4 procs for 1 steps with 4096000 atoms
Loop time of 0.892913 on 8 procs for 1 steps with 4096000 atoms

and here w/ padding:

Loop time of 2.31131 on 1 procs for 1 steps with 4096000 atoms
Loop time of 1.35883 on 2 procs for 1 steps with 4096000 atoms
Loop time of 0.828769 on 4 procs for 1 steps with 4096000 atoms
Loop time of 0.90598 on 8 procs for 1 steps with 4096000 atoms

Using OMP_PROC_BIND actually makes things slower again:

Loop time of 2.40802 on 1 procs for 1 steps with 4096000 atoms
Loop time of 1.36829 on 2 procs for 1 steps with 4096000 atoms
Loop time of 0.8527 on 4 procs for 1 steps with 4096000 atoms
Loop time of 0.964493 on 8 procs for 1 steps with 4096000 atoms

I usually set processor binding to “socket” on the MPI level and don’t bind OpenMP processes. That seems to work best for how the OpenMP package is implemented.

Just for reference, for a huge number of threads using KOKKOS with atomics may be better which is what Axel is referring to, need to compile with -DLMP_KOKKOS_USE_ATOMICS, see 7.4.3. KOKKOS package — LAMMPS documentation.

It is a virtual cluster on Google Cloud, the machines are called c2-standard-30. I reserved 16 nodes with --exclusive in slurm and it should be one MPI rank per node.

I get similar results on Piz Daint. Also one MPI rank per node with --exclusive.


$ for i in 6 12 24; do OMP_PROC_BIND=TRUE OMP_NUM_THREADS=$i srun --jobid=46981470 lmp_a -in in.sc -sf omp | awk '/^Neigh   \|/' | sed 1q; done
Neigh   | 2.5581     | 2.6236     | 2.7554     |   3.8 | 546.5 | 57.83
Neigh   | 1.4566     | 1.5771     | 1.7569     |   8.5 |1020.6 | 48.51
Neigh   | 0.86885    | 1.1125     | 1.9214     |  25.1 |1854.6 | 30.87
$ for i in 6 12 24; do OMP_PROC_BIND=TRUE OMP_NUM_THREADS=$i srun --jobid=46981470 lmp_b -in in.sc -sf omp | awk '/^Neigh   \|/' | sed 1q; done
Neigh   | 2.5322     | 2.5903     | 2.7054     |   2.9 | 545.8 | 57.72
Neigh   | 1.4593     | 1.4929     | 1.5368     |   1.8 |1005.5 | 48.98
Neigh   | 0.84329    | 0.89284    | 0.97801    |   4.6 |1732.5 | 33.53
$ srun --jobid=46981470 -n 1 lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
Address sizes:       46 bits physical, 48 bits virtual
CPU(s):              24
On-line CPU(s) list: 0-23
Thread(s) per core:  2
Core(s) per socket:  12
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               63
Model name:          Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz
Stepping:            2
CPU MHz:             3105.254
CPU max MHz:         2601.0000
CPU min MHz:         1200.0000
BogoMIPS:            5199.73
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            30720K
NUMA node0 CPU(s):   0-23

However, on my desktop, I also see no difference between padded and the original versinos.

Thank you. I am looking into KOKKOS. In benchmarks it does not appear to be significantly faster then OMP. These benchmarks likely do not include atomics.

Those benchmarks are quite old, but I’ve found in general atomics give poor performance on Intel CPUs, at least for a low number of threads, which is why they are not the default. However, data duplication which is used for the OPENMP package and by default for the KOKKOS package won’t scale to large numbers of threads since the memory overhead is high.