Segmentation fault every time I try to run intel acceleration

mence · February 25, 2022, 11:29pm

Hello,

I am currently try to test intel acceleration.
This is the script I used to download and compile LAMMPS:

#!/bin/bash

VERSION=29Sep2021

echo "deleting old tarball..." 
rm stable_${VERSION}.tar.gz || true 
echo "deleting old lammps build..." 
rm -rf lammps-stable_${VERSION} || true 

wget https://github.com/lammps/lammps/archive/stable_${VERSION}.tar.gz
tar zxf stable_${VERSION}.tar.gz
cd lammps-stable_${VERSION}
mkdir build && cd build

module purge
module load intel/19.1.1.217
module load intel-mpi/intel/2019.7

cmake3 -D CMAKE_INSTALL_PREFIX=$HOME/.local.lammps.extra_molecule \
-D CMAKE_BUILD_TYPE=Release \
-D LAMMPS_MACHINE=user_intel \
-D ENABLE_TESTING=yes \
-D BUILD_OMP=yes \
-D BUILD_MPI=yes \
-D CMAKE_C_COMPILER=icc \
-D CMAKE_CXX_COMPILER=icpc \
-D CMAKE_CXX_FLAGS_RELEASE="-Ofast -xHost -DNDEBUG" \
-D PKG_MOLECULE=yes -D PKG_RIGID=yes -D PKG_MISC=yes \
-D PKG_KSPACE=yes -D FFT=MKL -D FFT_SINGLE=yes \
-D PKG_EXTRA-MOLECULE=yes -D PKG_USER-INTEL=yes \
-D PKG_INTEL=yes -D INTEL_ARCH=cpu -D INTEL_LRT_MODE=threads ../cmake

make -j 16
make install

I need the EXTRA-MOLECULE package to do fourier type dihedrals. To run intel acceleration, I included the PKG_USER-INTEL.

Now, whenever I try to run a intel accelerated simulation, using the following command (lmp_user_intel is my lammps executable):
lmp _user_intel -sf intel -in in.file
I get the following seg fault:

Setting up cg style minimization ...
  Unit style    : real
  Current step  : 0
[stellar-intel:366883:0:366883] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid: 366883) ====
 0  /usr/local/ucx/1.9.0/lib64/libucs.so.0(ucs_handle_error+0x2a4) [0x14d20cf99154]
 1  /usr/local/ucx/1.9.0/lib64/libucs.so.0(+0x2232c) [0x14d20cf9932c]
 2  /usr/local/ucx/1.9.0/lib64/libucs.so.0(+0x224fa) [0x14d20cf994fa]
 3  /lib64/libpthread.so.0(+0x12c20) [0x14d211b19c20]
 4  lmp_user_intel() [0xd4b291]
 5  /opt/intel/compilers_and_libraries_2020.1.217/linux/compiler/lib/intel64/libiomp5.so(__kmp_invoke_microtask+0x93) [0x14d211e53cc3]
 6  /opt/intel/compilers_and_libraries_2020.1.217/linux/compiler/lib/intel64/libiomp5.so(__kmp_fork_call+0x3f7) [0x14d211dd9947]
 7  /opt/intel/compilers_and_libraries_2020.1.217/linux/compiler/lib/intel64/libiomp5.so(__kmpc_fork_call+0x183) [0x14d211d9d5c3]
 8  lmp_user_intel() [0xd4e3d5]
 9  lmp_user_intel() [0xd351fe]
10  lmp_user_intel() [0x4b575e]
11  lmp_user_intel() [0x4bba73]
12  lmp_user_intel() [0x4440b8]
13  lmp_user_intel() [0x441b67]
14  lmp_user_intel() [0x40b827]
15  /lib64/libc.so.6(__libc_start_main+0xf3) [0x14d20fe43493]
16  lmp_user_intel() [0x40b6ee]
=================================
Segmentation fault (core dumped)

However, if I simply do

lmp_user_intel -in in.file

It runs along perfectly fine. My question is, what am I doing wrong to not get intel acceleration? I have attached my input file to this message.

I would appreciate any advice you have for me.

in.pnipam (4.2 KB)

akohlmey · February 25, 2022, 11:41pm

Please also provide the data and settings files.

mence · February 25, 2022, 11:43pm

Thank you for your repsonse @akohlmey! I appreciate you taking the time.

I have attached the data file (sys.p1w.data) and settings file (sys.p1w.settings) file to this message

sys.p1w.data (2.8 MB)
sys.p1w.settings (6.3 KB)

akohlmey · February 25, 2022, 11:48pm

Question: why do you use pair style lj/long/coul/long and not lj/cut/coul/long?

mence · February 26, 2022, 6:39am

I do not have a deep answer for this. I just thought using lj/long would be a little more accurate than lj/cut. Does this affect intel acceleration?

akohlmey · February 26, 2022, 10:04am

With the “cut long” settings for the pair style, you have exactly the same potential as with the lj/cut/coul/long pair style. The long-range dispersion also requires a different kspace style. When I switch the pair style, your input can run on my machine. This doesn’t fix the bug, but avoids it. It will still need to be fixed, but you could run your calculation.

mence · February 26, 2022, 8:32pm

Thank you for your advice. The computation seems to be running without problems!

mence · February 26, 2022, 9:11pm

Actually, I might have spoken too soon. The simulation seems to run fine without -sf intel, but I run the simulation with -sf intel, it crashes out at the NPT stage.
The following is the stdoutput and stderr right after the NVT step:

Performance: 48.402 ns/day, 0.496 hours/ns, 560.205 timesteps/s
99.2% CPU use with 96 MPI tasks x 1 OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 6.8871     | 7.4039     | 8.0429     |   9.6 | 41.48
Bond    | 0.0049542  | 0.027534   | 0.19299    |  27.2 |  0.15
Kspace  | 4.5158     | 5.3119     | 5.7596     |  11.5 | 29.76
Neigh   | 0.41938    | 0.4221     | 0.42559    |   0.2 |  2.36
Comm    | 1.8177     | 1.8761     | 1.9599     |   2.0 | 10.51
Output  | 0.0049378  | 0.0050321  | 0.0050376  |   0.0 |  0.03
Modify  | 2.5665     | 2.634      | 2.6744     |   1.1 | 14.76
Other   |            | 0.1701     |            |       |  0.95

Nlocal:        287.740 ave         308 max         270 min 
Histogram: 3 5 6 22 16 25 9 9 0 1 
Nghost:        6321.50 ave        6383 max        6272 min 
Histogram: 2 8 13 25 15 11 12 5 2 3 
Neighs:        133070.0 ave      149678 max      120789 min 
Histogram: 9 6 8 24 12 19 9 5 2 2 

Total # of neighbors = 12774687
Ave neighs/atom = 462.46559
Ave special neighs/atom = 2.1904210
Neighbor list builds = 520 
Dangerous builds = 0 
Ran NVT step!
Finding SHAKE clusters ... 
       0 = # of size 2 clusters
       0 = # of size 3 clusters
       0 = # of size 4 clusters
    9017 = # of frozen angles
  find clusters CPU = 0.028 seconds
About to kick off npt...
PPPM initialization ... 
  using 12-bit tables for long-range coulomb (src/kspace.cpp:340)
  G vector (1/distance) = 0.3070986
  grid = 40 40 40
  stencil order = 7 
  estimated absolute RMS force accuracy = 0.0028115308
  estimated relative force accuracy = 8.4668416e-06
  using single precision MKL FFT 
  3d grid and FFT values/proc = 5415 800 
----------------------------------------------------------
Using Intel Package without Coprocessor.
Precision: mixed
----------------------------------------------------------
Setting up Verlet run ... 
  Unit style    : real
  Current step  : 10554
  Time step     : 1 
ERROR: Non-numeric pressure - simulation unstable (src/fix_nh.cpp:1069)
Last command: run   10000
srun: error: stellar-i02n9: tasks 0-95: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=118399.0

The computation is being run on 96 cores.

I made the minimization more stringent and allowed for a longer NVT simulation, but it does not seem to help with the system breaking down when NPT is kickstarted. Again, ironically, without -sf intel, things seem to go along fine.

Is there another piece of information that I am missing?

akohlmey · February 27, 2022, 11:28am

When using -sf intel you are using a different code path which uses cached copies of data to make then properly aligned for vectorization. These transformations are not always well debugged for the case of multi-step runs. There have been several bug reports about similar issues in the past and - when suitably documented - those bugs will eventually be fixed. Please keep in mind that the INTEL package is contributed code and some of the maintainers are rather busy while others have moved on, so bugfixes can take time until they are resolved and fed back into the LAMMPS distribution.

My suggestion for a workaround is to write out data files (or restarts) and split the multi-step run into multiple runs with explicit input files where the state from the previous run is read from a file.

akohlmey · February 27, 2022, 7:02pm

I did some careful auditing of the INTEL package code used by your input and discovered an off-by-one bug in dihedral style fourier/intel

So please try with the following additional modification (after changing the pair style):

angle_style   harmonic
suffix off
dihedral_style  fourier
suffix on
kspace_style   pppm 1.0e-05

Now I am trying to identify the issue in pair style lj/long/coul/long/intel…

mence · February 27, 2022, 8:00pm

Hello @akohlmey, sorry for not responding sooner! I ran your first suggestion of creating a different run for the NPT step after running a minimization and NVT equilibration. It worked, thanks a ton.

I will use the version above and inform you. Thank you again sir!

akohlmey · February 27, 2022, 8:05pm

I have also identify the reason for the segfault when using pair style lj/long/coul/long.
It turns out to be a wrapper class without any vectorization or other optimization. So using lj/cut/coul/long should give better performance. The segfault stems from the fact that the other INTEL package styles require the pair style to set up some special buffers in a rather indirect way, but this wrapper was not doing that.

mence · March 1, 2022, 7:05pm

@akohlmey, I ran some more tests. I was observing a lot of inconsistent behavior with -sf intel even after splitting up the runs into energy minimization+nvt and npt. Inconsistent in the sense that it would run sometimes, and wont run other times with NO changes made.

But your fix of suffix off ... suffix on resolved that issue as well. Thank you again.

akohlmey · March 1, 2022, 7:19pm

The corresponding changes have been committed to the development branch yesterday.
So if you download the snapshot from https://github.com/lammps/lammps/archive/refs/heads/develop.tar.gz
and compile a new executable, it should work with the original input (but lj/cut/coul/long should be faster as noted before).