Lammps crashing - OpenMPI + gzip'ed files => network errors

Dear All,

Recently moving LAMMPS to a new cluster, I had major problems.
Gromacs, Namd, Amber, etc. runs without problems which seems
to preclude hardware issues.

After a month of debugging and complaining to the system
administrators, I have traced this to a spurious result of using
gzip'ed input/output files with LAMMPS.

I'm reporting this since other people might be hit by this
problem, or will find this report useful if they search the list.

I'm using either gcc/4.8-c7+OpenMPI 1.8.6 or intel/2015.2+intelMPI
as compilers on a new cluster where each node has two Intel
E5-2680v3 CPUs i.e. with 24 cores/node and I'm mostly running
using 4 node jobs using all 24 cores. Interconnect is InfiniBand
FDR (56 Gbit/s). The queue system is SSLURM, and I'm simulating
variations of Kremer-Grest polymer melts.

I have experienced that almost all jobs hung (mostly intel) or crashed
with a variety of network related errors (mostly gcc+openMPI). The
times and circumstances of the crashes appeared quite random,
e.g. submitting the same job 5 times then perhaps 4 will crash and
1 will run. Reducing the number of cores would also increase the
chance of the job running.

Common for most runs using openMPI were that at some stage,
preceeding the crash, I got the following warning. IntelMPI issued
no warnings.

Dear All,

Recently moving LAMMPS to a new cluster, I had major problems.
Gromacs, Namd, Amber, etc. runs without problems which seems
to preclude hardware issues.

After a month of debugging and complaining to the system
administrators, I have traced this to a spurious result of using
gzip'ed input/output files with LAMMPS.

I'm reporting this since other people might be hit by this
problem, or will find this report useful if they search the list.

I'm using either gcc/4.8-c7+OpenMPI 1.8.6 or intel/2015.2+intelMPI

please note that OpenMPI version 1.8.6 has some serious memory leak
issues that are exposed by LAMMPS.
it is highly recommended to upgrade to a newer OpenMPI version 1.8.8
is the latest right now.

as compilers on a new cluster where each node has two Intel
E5-2680v3 CPUs i.e. with 24 cores/node and I'm mostly running
using 4 node jobs using all 24 cores. Interconnect is InfiniBand
FDR (56 Gbit/s). The queue system is SSLURM, and I'm simulating
variations of Kremer-Grest polymer melts.

I have experienced that almost all jobs hung (mostly intel) or crashed
with a variety of network related errors (mostly gcc+openMPI). The
times and circumstances of the crashes appeared quite random,
e.g. submitting the same job 5 times then perhaps 4 will crash and
1 will run. Reducing the number of cores would also increase the
chance of the job running.

Common for most runs using openMPI were that at some stage,
preceeding the crash, I got the following warning. IntelMPI issued
no warnings.

--------------------------------------------------------------------------
An MPI process has executed an operation involving a call to the
"fork()" system call to create a child process. Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your MPI job may hang, crash, or produce silent
data corruption. The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.

The process that invoked fork was:

   Local host: s61p32 (PID 31832)
   MPI_COMM_WORLD rank: 0

If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--------------------------------------------------------------------------

The simulation would then appear to proceed happily for e.g. 5000-100000
steps, but then crash (gcc+openmpi) with errors such as "The OpenFabrics
stack has reported a network error event.", "WORK REQUEST FLUSHED
ERROR", "RECEIVER NOT READY RETRY EXCEEDED ERROR",
or "REMOTE ACCESS ERROR".

The intel compiler would hang silently, until the queue system kills the
job when the maximal jobs duration is reached. Logging into the nodes
and using gstack shows that the hung processes were stuck in MPI_wait
or MPI_allreduce calls.

What I have realised is that the warning above is generated by the popen
call used to read/write gzip'ed files, since popen does an internal call to
clone. E.g. when the first dump file is stored in a restarted simulation.

I do not know the cause-effect relationship leading to the crashes at much
later times (e.g. 5000-100000 steps later), but what solved it was not to
use gzip'ed files for input/output. With this trivial modification my LAMMPS
simulations has been running completely stable.

this behavior is a known issue for *all* MPI implementations that use
pinned memory for RDMA communication and dates back to the times when
myrinet was the prevalent high-speed interconnect. OpenMPI is indeed
one of the few MPI implementations that gives a meaningful warning.
the reason for errors happening much later is too technical to be
explained here.

it would also affect dump movie or generic shell commands that all use
the fork() library call underneath.

axel.

FWIW,

i just sat down and implemented a set of dump styles using zlib
library calls instead of a pipe to a gzip executable, which will avoid
the fork() issue on pinned memory based high speed communication. if
somebody is interested to try them out, i'm including them in the
LAMMPS-ICMS branch. i've implemented new dump styles atom/gz, cfg/gz,
custom/gz, and xyz/gz which should work like their non-gz versions
only that a .gz suffix is required.

cheers,
     axel.

Dear Axel,

Thanks for the explanation of this bug. I've got a copy
of LAMMPS-ICMS and will give it a test. I've also forwarded
your explanation to our system admins to get OpenMPI
updated.