Segmentation fault at high total number of timesteps

Hello,

I am new to LAMMPS in general, though I am facing a basic problem with the code which seems quite strange. I am using REAXFF potential user package, and when I run in parallel a script I get a segmentation fault at timesteps above 100,000. The same problem does not occur when I run in one processor.
I replicated the problem using one of LAMMPS’s default examples (with path: examples/reax/FeOH3/) with only modification the total number of timesteps from 3,000 to 3,000,000. I get a segmentation fault at ~300,000 timestep. I attach the script below for reference. I used both recent versions of LAMMPS for testing, i.e. “lammps-4Feb20” and “lammps-7Aug19”, as well as different compiling options, “make mpi” and “make intel_cpu_intelmpi” on two different machines, and with a number of processors of 4, 16 and 24. Could you please advise me on the matter?

Thank you very much,
Efstratios

REAX potential for Fe/O/H system

units real

atom_style charge

read_data data.FeOH3

pair_style reax/c NULL # lmp_control

pair_coeff * * ffield.reax.Fe_O_C_H H O Fe

neighbor 2 bin

neigh_modify every 10 delay 0 check no

fix 1 all nve

fix 2 all qeq/reax 1 0.0 10.0 1e-6 param.qeq

fix 3 all temp/berendsen 500.0 500.0 100.0

timestep 0.25

thermo 1

#dump 1 all atom 30 dump.reax.feoh

run 3000000

Have you visualized the trajectory? does it show unusual behavior? does the structure change a lot?
are there any indications in the thermo output, that things don’t work properly?
Does the segfault also happen, if you break down the single run into multiple run statements?

Axel.

Dear Axel,

Thank you deeply for your quick reply and suggestions. I did test these parameters on previous simulations, but to be more specific I am forwarding you one case study of the REAXFF example (attached files are in.FeOH3, in.FeOH3_restart, log.lammps). The case was tested on 24 processors using the compiled option of LAMMPS “make mpi” and loaded with the USER-REAXFF package. The segmentation fault happened at timestep 72900.
Concerning the results: 1) the thermo output values seem correct. 2) The trajectories seem reasonable as well, with a tendency to agglomerate at the final timesteps (though since the number of atoms is trivial I don’t see how this could cause a problem ex. number of atoms per neighbor list page setting) 3) the structure does not change substantially; apart from the aforementioned agglomeration nothing changes, as no reactions observed and neither any bizarre bond formation. 4) Finally, I tested starting the simulation from a restart file dumped at timestep 70000 (see file in.FeOH3_restart) and, to my surprise, the segmentation fault did not appear again (tested till timestep ~650,000).
I should also mention here that running the script loaded with a memory debugger (valgrind, with options --leak-check=full --show-leak-kinds=all --track-origins=yes) I get some weird conditional jumps on uninitialized values, in the middle of the run time arising from REAXFF cpp files (please see the attached valgrind_output).

If you have any more suggestions please let me know. Thank you again,
Efstratios

in.FeOH3 (716 Bytes)

in.FeOH3_restart (700 Bytes)

log.lammps (73.8 KB)

valgrind_output (7.79 KB)

This is missing the data file, so I cannot really make tests of my own.

If you say, that your system is “clustering”, then that could be an explanation. The USER-REAXC code makes some initial memory allocations for dynamic properties based on the needs for the initial structure. If atoms cluster, they will get more neighbors on average and thus require more storage for that kind of information. This can exceed the pre-allocated sizes. Technically, this should be prevented with some code internal to USER-REAXC, but because of the history how this code was developed and how it got integrated into LAMMPS, this didn’t happen.

This is why I asked to break down the run into multiple segments. With each new segment, the heuristic memory allocation is repeated and then any larger changes in the environment are not likely to exceed the original estimates. If the “clustering” behavior is the expected behavior of your material, then just doing 1-2 short initial runs and then restarting from an equilibrated geometry is the way to go about it.

There are two more options that you can consider and try out:

  • use the KOKKOS version of the reax/c pair style. it has a more robust heuristic memory allocation scheme
  • you can play with the “mincap” and “safezone” parameters to increase the storage buffers during the heuristics based memory allocation

HTH,
Axel.

Dear Axel,

My apologies for the data file, though it is the same as the one distributed with lammps-7Aug19 code.
The suggestion to increase the “mincap” and “safezone” parameters did solve the problem in the previously attached case and the other cases I study. However the other cases I mention are combustion simulations with small hydrocarbons, that do not of course form any agglomeration during run time but react into smaller products, thus it seems strange this solution worked even for “non-clustered” systems. Maybe the allocation at the beginning is really strict, and eventually this is a more general issue with the reaxff lammps code. May I suggest that these parameters be enclosed in a note section in the documentation of lammps and not just mention in-text that a segmentation fault can occur if not included? This would make it clearer for users to have in mind when using the REAXFF package.

Either way thank you very much for all your help,
Efstratios

Dear Axel,

My apologies for the data file, though it is the same as the one distributed with lammps-7Aug19 code.

my bad. i overlooked the section where you mentioned that.

The suggestion to increase the “mincap” and “safezone” parameters did solve the problem in the previously attached case and the other cases I study. However the other cases I mention are combustion simulations with small hydrocarbons, that do not of course form any agglomeration during run time but react into smaller products, thus it seems strange this solution worked even for “non-clustered” systems.

the issue is in all cases that there are sufficient structural changes, which will result in different storage needs. if those changes are too many and go too far, then the heuristics will fail. it could be made safer, but then it will lead to waste of memory (and not being able to run large calculations, even though they would normally fit) or losing performance. the mincap and safezone parameters were exposed to the user interface for exactly that purpose.

in general, i consider it good practice to break simulations into multiple parts, and keep the setup/equilibration separate from the production run and keep data files and not just binary restarts for the intermediate steps. that avoids problems with these heuristics (although there are few of them in LAMMPS) and makes it easier to go back to an intermediate result and continue from there into a different direction, if needed instead of having to redo the whole process.

Maybe the allocation at the beginning is really strict, and eventually this is a more general issue with the reaxff lammps code. May I suggest that these parameters be enclosed in a note section in the documentation of lammps and not just mention in-text that a segmentation fault can occur if not included? This would make it clearer for users to have in mind when using the REAXFF package.

a segmentation fault is something that should not happen. especially, since this happens in a section of the code that is not time critical, it should be possible to check for it. that is why i suggested using the KOKKOS version (you can compile KOKKOS in serial mode for the CPU with MPI), which includes some of these more careful checks and reallocations.

having a small test case to reproduce the issue, is usually helpful to figure out a way to at the very least put in some safeguards and avoid the segmentation faults and instead print some more meaningful error message instead.

thanks,
axel.