I am experiencing a wired problem in running in LAMMPS.
There is about 40k atoms in my stystem, the Memory usage per processor is about 4.3 M (output from LAMMPS).
When I run a small number of timesteps (about ~100) to test the setup of this system, everything is fine. But when I increase to about 1M timesteps, the simulation terminates at some timestep between 100k-900k. The only error message is something like below. I am not sure it’s the computer cluster’s problem or something is wrong in my system setup. Can someone tell me what’s
happening here?
"MPI_ABORT was invoked on rank 25 in communicator MPI_COMM_WORLD
with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them."
Please update to the latest version of LAMMPS and try again as the issues
might have been fixed.
Also note that your script is not very useful without the data file.
nobody is going to run it for 100k to 900k timesteps anyway.
regardless of that, there should be some other output *before* the
MPI_abort message signifying what is happening to the system. for
example the system could have reached a state, where the integration
of the equations of motion becomes numerically unstable and/or
particles move to fast.
this often manifests in "lost atoms" or energies suddenly increasing
to unreasonable values.
I’ve also seen these issues when MPI software has glitched and communication has been terminated. Can you restart the simulation from where it was terminated?
Confidentiality Notice | This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential or proprietary information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, immediately contact the sender by reply e-mail and destroy all copies of the original message.
MPI_ABORT was invoked on rank 25 in communicator MPI_COMM_WORLD
with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
Aiqun,
You can write the restart file a couple of hundred time steps before the simulation crashes e.g. at 814700. Then write a dump file to scrutinize the simulation variables (as Axel suggested) more often in time. Also in the output that you’ve attached you should specify what those columns of numbers are? first one should be timestep number but what is the second column?