[lammps-users] LAMMPS parallel REAXC long time NVT simulation quit abnormally

Hi friends,

I set up my simulation with REAXC in parallel LAMMPS environment. Thermostat is NVT at 1000K. The targeting simulation time is 400ps, however, the simulation stopped with the following error message at about 12 ps. In the case of using a serial fortran code, my input files run good.

My system is no large, roughly 400 atoms. I used to choose 4,8, and 24 cpus for parallel computing. But the error message is the same. Has any one met the similar case and any suggestion? It is hard to believe such a small system would have too many bonds to be assigned during the simulation. I remember there are someone report on the stability issue for NVE with Reaxff. Probably this is another problem with reaxC in lammps.

p2: not enough space for bonds! total=46496 allocated=46452
application called MPI_Abort(MPI_COMM_WORLD, -14) - process 2
MPI process (rank: 2) terminated unexpectedly on …
Exit code -5 signaled from …

Maybe Aidan and Metin can comment. This is an internal error
in the REAX/C code. Is your system well behaved? I.e. the
output thermo statistics look reasonable?

Steve

Yes, the temperature is well converged to the targeting value( 1000K in my case). From VMD, the trajectory also looks reasonable. In TACC, it has similar error message, so I thought it might be a problem with the memory allocation / disallocation in the code.

Thanks,
hengji

Does it always crash at the same timestep? Could you reproduce this running
on a single processor? Can you reproduce it from a restart file, so that it
will crash in just a few timesteps? It is probably something fairly simple
to fix in the dynamic memory allocation, but it would be hard to find
without a fast-failing example.

Aidan

Hi Aidan,

I did more test with Lammps on TACC. In the case of 36 cpus (i.e. 3 nodes on TACC) for 384 atoms systems, it crashed at the same time step. For example, it can always stop at ~1300 step. Then, I let the simulation stop at 1299, and the simulation is finished well. After that I use a restart file to continue another 1299 steps with no problem. Also it looks good. It seems that when I met an unavoidable stop point, we can stop earlier than that point, and use restart file to run at a given time. It is not quite convenient as we usually need to check for that stop point. You mention that a fast-failing case is helpful to diagonalize the dynamic memory allocation. Is 1300 steps small enough?

As for the single processor case, I also did a test, it runs with no problem up to 200ps.

Thanks,
Hengji

Can you get it to fail after abour 1000 steps on fewer processors, like 1 or
2?

I have tested with 4 processors, it stops at 51900. For single processor, it runs OK at least up to 200ps.

Dear Hengji,

Thank you for your detailed description of the problem. Would you mind sharing your data.test file so that I can take a look at the problem? Since it is only 400 atoms, I guess it does not take much time to take 51900 steps with 4 processors, right?

Thanks,
Metin