ReaxFF, problem with multi-processor runs

Dear LAMMPS users,

I’m doing research related to simulate reaction that occurs in a SOFC. I have a model using ReaxFF potential witch run when I use only one processor and I obtained good results. Now, I’m traying to improve the performance and I want to parallelize it. For this, I compiled the KOKKOS package, and I launch the calculations in a cluster with multiple processors.

The model star run, but quickly appears and error: “segmentation fault” and the calculation crashes.

I would like to know if it can be a problem with bad dynamics model (but the same model run perfectly in one processor) or someone knows which kid of problem it can be.

I’m using the 15 Sep 2022 version and I use the potential of Reactive MD-force field; Cu, Ni, Co and all-carbon with Y/Zr/Ba parameterized by Boris V. Merinov et al.

Thank you very much for your attention!

Fran

Dear Fran,

Unfortunately, it is not so easy to provide an explanation and advice based on the limited information in your message.

One common problem with ReaxFF simulations that fits your description is that the internal memory management of the ReaxFF implementation in LAMMPS expects that the system does not change much (i.e. number of bonds, hydrogen bonds etc.). This may, however, be different between a single processor run and a multi-processor run due to using domain decomposition. The risk of a problem will grow significantly and faster than linear with the number of MPI processes used. However, this is just a guess. There may be one or more other issues at hand. Quite a few problems can happen only with multiple MPI processes and not with a single process. That is because in performance critical parts of the code, LAMMPS does not spend the time to check if an operation is “safe”, but will just assume that it is and then crash when it is not.

In order to give more specific advice (and have a better chance of assessing what the problem could be), we need access to your input, need to know the exact command line you use or the submit script (in case of running on a cluster with a batch system) and a log file or screen capture of a successful and a failed run.

Thanks,
Axel.

Thank you so much Axel. I’m going to try to optimize my model and then I will try again. If I get the error again, I will contact you.

KOKKOS version is more memory robust and shouldn’t fail with memory errors like the original C++ version. As Axel said, we’d need a minimal reproducer input to track down the segmentation fault. Would also be good to try the latest version of LAMMPS, since a bug may have been fixed.