Error in Lammps Simulation : MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD

Hello everyone,

I am running a lammps simulation and using ReaxFF. After 16 000 steps of simulation I encounter the next error :

MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
Proc: [[31507,0],0]
Errorcode: 1

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

srun: error: a001: task 11: Exited with exit code 1

The output of log.lammps and the output of slurm does not show any particular error, The last line correspond to the last step of the simulation.

Can it be linked with the number of processors that I am using (16), can it be an error linked with memory?

I share my input file, and my error files :

a-relaxation.lammps (3.0 KB)
log.lammps (71.1 KB)
slurm.1356055.err (7.4 KB)

You can debug this by using the LAMMPS command-line flag -nb or -nonbuf to turn off output buffering. This will impact your performance somewhat, but you should not get a truncated log.lammps file that is missing the final bits including the actual error message.

This is likely related to the “quirky” memory management of the plain ReaxFF implementation. It assumes by default that the initial geometry is already equilibrated and that the environment doesn’t change much, i.e. it is dense system and does not reconstruct or redistribute much.
This assumption is not met with your use of fix deform.

You have multiple options to avoid or postpone this.

  • use the KOKKOS version of ReaxFF (it can be compiled in Serial mode, or with OpenMP only) which uses a different way of managing memory
  • break down your “run 15000” command with fix deform active into multiple shorter run commands that also use the start/stop keyword to distribute the intended deformation across those multiple segments. See the run command docs for details.
  • increase the “safezone” parameter (this can be combined with the previous option)