MPI_ABORT with error code 1

Dear all,

I am experiencing a wired problem in running in LAMMPS.

There is about 40k atoms in my stystem, the Memory usage per processor is about 4.3 M (output from LAMMPS).

When I run a small number of timesteps (about ~100) to test the setup of this system, everything is fine. But when I increase to about 1M timesteps, the simulation terminates at some timestep between 100k-900k. The only error message is something like below. I am not sure it’s the computer cluster’s problem or something is wrong in my system setup. Can someone tell me what’s

happening here?

"MPI_ABORT was invoked on rank 25 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them."

Aiqun

There is not enough information here. Please include information such as hardware and version of lammps and attche a simple input deck.

Ray

Hi Ray,

Thanks for replying. The LAMMPS version is "#define LAMMPS_VERSION “9 Dec 2014”, and it was built with openmpi-1.8.3.

The input script is attached.

Aiqun

input.triblock (2.25 KB)

Please update to the latest version of LAMMPS and try again as the issues might have been fixed.

Also note that your script is not very useful without the data file.

Ray

Please update to the latest version of LAMMPS and try again as the issues
might have been fixed.

Also note that your script is not very useful without the data file.

nobody is going to run it for 100k to 900k timesteps anyway.

regardless of that, there should be some other output *before* the
MPI_abort message signifying what is happening to the system. for
example the system could have reached a state, where the integration
of the equations of motion becomes numerically unstable and/or
particles move to fast.
this often manifests in "lost atoms" or energies suddenly increasing
to unreasonable values.

axel.

I’ve also seen these issues when MPI software has glitched and communication has been terminated. Can you restart the simulation from where it was terminated?

Jim Kress

James Kress Ph.D., President

The KressWorks Foundation ©

An IRS Approved 501 ©(3) Charitable, Nonprofit Organization

“Improving Lives One Atom At A Time” TM

(248) 605-8770

Learn More and Donate At:

http://www.kressworks.org

Confidentiality Notice | This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential or proprietary information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, immediately contact the sender by reply e-mail and destroy all copies of the original message.

The simulation just suddenly terminates without generating the restart file. And

it indeed doesn’t give any other message.
Here is the last part of the output of the simulation (the last 100 line of the output is attached. )

814300 1.0043277
814400 1.0050779
814500 1.0017614
814600 1.0055619
814700 1.0045624
814800 1.005667--------------------------------------------------------------------------

MPI_ABORT was invoked on rank 25 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

out_last_100.txt (2.48 KB)

Aiqun,
You can write the restart file a couple of hundred time steps before the simulation crashes e.g. at 814700. Then write a dump file to scrutinize the simulation variables (as Axel suggested) more often in time. Also in the output that you’ve attached you should specify what those columns of numbers are? first one should be timestep number but what is the second column?

Best,
Kasra.