@akohlmey, thank you for looking into this. Let me dump the tests I ran on the cluster, hope it is of some help. I’ve changed the configuration (more atoms, ten steps ). lmp_a
is the baseline, lmp_b
is with padding, the outputs are for three consecutive runs
$ cat in.sc
region region block 0 160 0 160 0 160 units box
create_box 1 region
lattice sc 8
create_atoms 1 box
mass 1 1
neighbor 0.0 bin
neigh_modify delay 0 every 1 check no
pair_style lj/cut 1
pair_coeff * * 1 1
fix nve all nve
timer full sync
timestep 0
run 10
$ export OMP_PROC_BIND=FALSE
$ OMP_NUM_THREADS=30 srun --mpi=pmi2 --jobid=1602 -n 16 lmp_a -sf omp -in in.sc | awk '/^Neigh \|/' | sed 1q
Neigh | 0.7534 | 1.6972 | 3.7739 | 57.6 |2545.2 | 30.63
Neigh | 0.75636 | 1.7308 | 3.7505 | 52.9 |2554.2 | 31.24
Neigh | 0.62556 | 1.5402 | 2.9796 | 46.3 |2505.5 | 32.24
$ OMP_NUM_THREADS=30 srun --mpi=pmi2 --jobid=1602 -n 16 lmp_b -sf omp -in in.sc | awk '/^Neigh \|/' | sed 1q
Neigh | 0.58162 | 0.633 | 0.72198 | 4.8 |2044.1 | 25.67
Neigh | 0.58698 | 0.64302 | 0.76628 | 6.5 |2051.5 | 25.41
Neigh | 0.58115 | 0.63367 | 0.78549 | 6.3 |2045.5 | 25.07
$ export OMP_PROC_BIND=TRUE
$ OMP_NUM_THREADS=30 srun --mpi=pmi2 --jobid=1602 -n 16 lmp_a -sf omp -in in.sc | awk '/^Neigh \|/' | sed 1q
Neigh | 0.6112 | 1.1819 | 2.1394 | 37.1 |2408.1 | 30.13
Neigh | 0.60916 | 1.0589 | 2.0349 | 37.5 |2333.8 | 27.94
Neigh | 0.61074 | 1.1275 | 2.0526 | 39.8 |2361.6 | 29.33
$ OMP_NUM_THREADS=30 srun --mpi=pmi2 --jobid=1602 -n 16 lmp_b -sf omp -in in.sc | awk '/^Neigh \|/' | sed 1q
Neigh | 0.58082 | 0.62216 | 0.71803 | 5.1 |2025.1 | 25.33
Neigh | 0.57067 | 0.61375 | 0.71362 | 5.0 |2012.5 | 25.03
Neigh | 0.56845 | 0.6386 | 0.73767 | 6.6 |2040.5 | 25.81
OpenMP envirment is
$ OMP_DISPLAY_ENV=TRUE srun --mpi=pmi2 --jobid=1607 -n 1 lmp_a -in abc
OPENMP DISPLAY ENVIRONMENT BEGIN
_OPENMP = '201511'
OMP_DYNAMIC = 'FALSE'
OMP_NESTED = 'FALSE'
OMP_NUM_THREADS = '30'
OMP_SCHEDULE = 'DYNAMIC'
OMP_PROC_BIND = 'TRUE'
OMP_PLACES = '{0},{15},{1},{16},{2},{17},{3},{18},{4},{19},{5},{20},{6},{21},{7},{22},{8},{23},{9},{24},{10},{25},{11},{26},{12},{27},{13},{28},{14},{29}'
OMP_STACKSIZE = '0'
OMP_WAIT_POLICY = 'PASSIVE'
OMP_THREAD_LIMIT = '4294967295'
OMP_MAX_ACTIVE_LEVELS = '2147483647'
OMP_CANCELLATION = 'FALSE'
OMP_DEFAULT_DEVICE = '0'
OMP_MAX_TASK_PRIORITY = '0'
OMP_DISPLAY_AFFINITY = 'FALSE'
OMP_AFFINITY_FORMAT = 'level %L thread %i affinity %A'
OPENMP DISPLAY ENVIRONMENT END
MPI ranks are on different nodes.
$ OMP_NUM_THREADS=30 srun --mpi=pmi2 --jobid=1607 -n 16 hostname
slurm-chg-compute-7-0
slurm-chg-compute-7-1
slurm-chg-compute-7-3
slurm-chg-compute-7-6
slurm-chg-compute-7-4
slurm-chg-compute-7-2
slurm-chg-compute-7-5
slurm-chg-compute-7-8
slurm-chg-compute-7-7
slurm-chg-compute-7-9
slurm-chg-compute-7-12
slurm-chg-compute-7-10
slurm-chg-compute-7-11
slurm-chg-compute-7-15
slurm-chg-compute-7-13
slurm-chg-compute-7-14
$ srun --mpi=pmi2 --jobid=1607 -n1 lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 30
On-line CPU(s) list: 0-29
Thread(s) per core: 2
Core(s) per socket: 15
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) CPU @ 3.10GHz
Stepping: 7
CPU MHz: 3100.202
BogoMIPS: 6200.40
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 25344K
NUMA node0 CPU(s): 0-29
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc eagerfpu pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 arat avx512_vnni md_clear spec_ctrl intel_stibp arch_capabilities