[SPAM] error while running lammps in more than 2 nodes

Dear Sir,

My student is running a job using lammps (as on released 14 July 2016)

His code runs fine using two compute nodes in a cluster having Intel mpirun Version 4.1 Update 3.

But whenever he changes from 2 to 4 or 6 nodes, his job is crashing with the below error message.

[11:node6] unexpected disconnect completion event from [10:node5]
Assertion failed in file …/…/dapl_conn_rc.c at line 1179: 0
internal ABORT - process 11

The error file has been reported in Internet but I could not find a solution to it.

Please help.

Dear Sir,

My student is running a job using lammps (as on released 14 July 2016)

His code runs fine using two compute nodes in a cluster having Intel mpirun
Version 4.1 Update 3.

But whenever he changes from 2 to 4 or 6 nodes, his job is crashing with the
below error message.

[11:node6] unexpected disconnect completion event from [10:node5]
Assertion failed in file ../../dapl_conn_rc.c at line 1179: 0
internal ABORT - process 11

The error file has been reported in Internet but I could not find a solution
to it.

Please help.

it is extremely difficult to provide specific help with so little information.

there should be output from *before* that error message. the message
is from the lo-level communication library and not LAMMPS.

anyway, this looks a lot like there could be a problem with the
setup/hardware of the cluster you are running on.
so i would recommend to test for that, e.g. by running the examples in
the "bench" or "examples/*" folders under the same conditions.
if bench/in.lj crashes with the same error, then it is *extremely*
likely that the fault is with your cluster; this input runs with any
LAMMPS version even without any packages installed.

if those inputs work with > 2 nodes, then you need to verify, whether
this is a problem due to a bug, that has already been fixed and update
LAMMPS to the very latest patch level and test again with that.
if this fails as well, we need to see a *complete* input deck of the
*smallest* possible system that can reproduce the crash with 4-8
processors within less than 5 minutes (the less the better).

axel.