weird MPI errors on Intel 2018 + Open MPI

Hi all,

I am running into some weird problems. In my modified version of LAMMPS, if I build it (including USER-INTEL) with Intel 18.0.2 + Intel MPI, getting lmp_knl, and execute it on KNL (Stampede 2), then everything is happy: code run fine (and FAST!!).

I tried again on another cluster, building lmp_intel_cpu_openmpi, (with Intel 18.0.0 and OpenMPI/3.1.0) then I got some random MPI messages. What’s more: it does not appear 100% of the time. If I have to estimate, it would show up on roughly 60% of the cases, and 40% others it will be fine.

This is how an error looks like:

LAMMPS (11 May 2018)
using 1 OpenMP thread(s) per MPI task
[scc-wj3:74391] *** Process received signal ***
[scc-wj3:74391] Signal: Segmentation fault (11)
[scc-wj3:74391] Signal code: Address not mapped (1)
[scc-wj3:74391] Failing at address: 0x10
[scc-wj3:74391] [ 0] /lib64/libpthread.so.0(+0xf7e0)[0x2b6cae0cd7e0]
[scc-wj3:74391] [ 1] /share/pkg/intel/2018/install/daal/…/tbb/lib/intel64_lin/gcc4.4/libtbbmalloc.so.2(+0x11bd5)[0x2b6cad152bd5]
[scc-wj3:74391] [ 2] /share/pkg/intel/2018/install/daal/…/tbb/lib/intel64_lin/gcc4.4/libtbbmalloc.so.2(+0x146a8)[0x2b6cad1556a8]
[scc-wj3:74391] [ 3] /share/pkg/intel/2018/install/daal/…/tbb/lib/intel64_lin/gcc4.4/libtbbmalloc.so.2(scalable_aligned_realloc+0x61)[0x2b6cad154eb1]
[scc-wj3:74391] [ 4] …/lammps-intel/src/lmp_intel_cpu_openmpi[0x6577eb]
[scc-wj3:74391] [ 5] …/lammps-intel/src/lmp_intel_cpu_openmpi[0x43d0a7]
[scc-wj3:74391] [ 6] …/lammps-intel/src/lmp_intel_cpu_openmpi[0x411bbf]
[scc-wj3:74391] [ 7] …/lammps-intel/src/lmp_intel_cpu_openmpi[0x62d2f3]
[scc-wj3:74391] [ 8] …/lammps-intel/src/lmp_intel_cpu_openmpi[0x62948d]
[scc-wj3:74391] [ 9] …/lammps-intel/src/lmp_intel_cpu_openmpi[0x6532d1]
[scc-wj3:74391] [10] /lib64/libc.so.6(__libc_start_main+0xfd)[0x2b6cae2f9d1d]
[scc-wj3:74391] [11] …/lammps-intel/src/lmp_intel_cpu_openmpi[0x408c29]
[scc-wj3:74391] *** End of error message ***

Hi all,

I am running into some weird problems. In my modified version of LAMMPS, if I build it (including USER-INTEL) with Intel 18.0.2 + Intel MPI, getting lmp_knl, and execute it on KNL (Stampede 2), then everything is happy: code run fine (and FAST!!).

I tried again on another cluster, building lmp_intel_cpu_openmpi, (with Intel 18.0.0 and OpenMPI/3.1.0) then I got some random MPI messages. What’s more: it does not appear 100% of the time. If I have to estimate, it would show up on roughly 60% of the cases, and 40% others it will be fine.

This is how an error looks like:

LAMMPS (11 May 2018)
using 1 OpenMP thread(s) per MPI task
[scc-wj3:74391] *** Process received signal ***
[scc-wj3:74391] Signal: Segmentation fault (11)
[scc-wj3:74391] Signal code: Address not mapped (1)
[scc-wj3:74391] Failing at address: 0x10
[scc-wj3:74391] [ 0] /lib64/libpthread.so.0(+0xf7e0)[0x2b6cae0cd7e0]
[scc-wj3:74391] [ 1] /share/pkg/intel/2018/install/daal/…/tbb/lib/intel64_lin/gcc4.4/libtbbmalloc.so.2(+0x11bd5)[0x2b6cad152bd5]
[scc-wj3:74391] [ 2] /share/pkg/intel/2018/install/daal/…/tbb/lib/intel64_lin/gcc4.4/libtbbmalloc.so.2(+0x146a8)[0x2b6cad1556a8]
[scc-wj3:74391] [ 3] /share/pkg/intel/2018/install/daal/…/tbb/lib/intel64_lin/gcc4.4/libtbbmalloc.so.2(scalable_aligned_realloc+0x61)[0x2b6cad154eb1]
[scc-wj3:74391] [ 4] …/lammps-intel/src/lmp_intel_cpu_openmpi[0x6577eb]
[scc-wj3:74391] [ 5] …/lammps-intel/src/lmp_intel_cpu_openmpi[0x43d0a7]
[scc-wj3:74391] [ 6] …/lammps-intel/src/lmp_intel_cpu_openmpi[0x411bbf]
[scc-wj3:74391] [ 7] …/lammps-intel/src/lmp_intel_cpu_openmpi[0x62d2f3]
[scc-wj3:74391] [ 8] …/lammps-intel/src/lmp_intel_cpu_openmpi[0x62948d]
[scc-wj3:74391] [ 9] …/lammps-intel/src/lmp_intel_cpu_openmpi[0x6532d1]
[scc-wj3:74391] [10] /lib64/libc.so.6(__libc_start_main+0xfd)[0x2b6cae2f9d1d]
[scc-wj3:74391] [11] …/lammps-intel/src/lmp_intel_cpu_openmpi[0x408c29]
[scc-wj3:74391] *** End of error message ***

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

Lattice spacing in x,y,z = 0.00111881 0.00111881 0.00111881
Created orthogonal box = (0 0 0) to (0.452 0.113 0.00111881)

mpirun noticed that process rank 3 with PID 0 on node scc-wj3 exited on signal 11 (Segmentation fault).

I am doing both checks: my code and also the compiler used on the non-KNL cluster. Are there any suggestions if there are other areas I might have overlooked?

​you have to compile LAMMPS with debuginfo included and obtain a proper stack trace from the debugger to see where the error is originating. the information here is not sufficient to make any reliable diagnosis.

axel.​

You could try building with “-DLMP_INTEL_NO_TBB” added to the CCFLAGS line in the makefile to turn off use of the TBB allocators.

If the problem persists, please do send me a reproducer if possible.

Thanks Quang.

  • Mike

Hi Mike,

Turning the TBB Allocator off seems to have resolved the issue - thank you very much!

If I may ask, what exactly does TBB do? And would turning this Thread Building Blocks off affect the performance of LAMMPS’ Intel + OpenMPI tremendously?

Thanks,
Quang

Hi Quang,

Sorry for the late reply; was on a long vacation. Loading and storing contiguous data for vector calculations can be most efficient if the data is aligned to the vector width in memory (not split across cache lines). Currently there is no standard way for asserting alignment with realloc() that I am aware of, so the routine for aligned realloc in TBB is used when the Intel compilers are available, otherwise a much less elegant approach is used. This is on a list to fix in a more general way for LAMMPS…

Best, - Mike