[lammps-users] seg fault in run with long range bonded interactions

lammps-users · February 25, 2021, 7:11pm

I’m running a somewhat weird system - only bonded pair terms, but in a periodic solid-state system and out to very long range, so very many bonds - 2048 atoms and 93696 bonds. I’m getting intermittent segmentation faults - the details are below. Running systems that are identical except for having fewer bonds (only using shorter range interactions) has been working fine for hundreds of runs, with various particular sets of bonded interactions.

I think I’d call it a bug (inherently, because of the type of crash), whether in LAMMPS or the OS (Cray XC40/50 running SLES 12), but I wouldn’t be surprised if it’s because I’m in some regime that was never planned for. Does anyone have any ideas about figuring out what’s going on, or what I might need to do to fix it?

stdout/stderr has the message:

*** Error in *** Error in *** Error in /p/home/noamb/src/work/Skutterudites/GK_MD/LAMMPS/clean_rerun/lmp': double free or corruption (!prev): 0x0000000002294380 *** *** Error in /p/home/noamb/src/work/Skutterudites/GK_MD/LAMMPS/clean_rerun/lmp’: corrupted double-linked list: 0x0000000002294300 ***
*** Error in /p/home/noamb/src/work/Skutterudites/GK_MD/LAMMPS/clean_rerun/lmp': double free or corruption (!prev): 0x0000000002294340 *** *** Error in /p/home/noamb/src/work/Skutterudites/GK_MD/LAMMPS/clean_rerun/lmp’: corrupted double-linked list: 0x0000000002294300 ***
*** Error in `/p/home/noamb/src/work/Skutterudites/GK_MD/LAMMPS/clean_rerun/lmp’: corrupted double-linked list: 0x0000000002294380 ***
/p/home/noamb/src/work/Skutterudites/GK_MD/LAMMPS/clean_rerun/lmp’: corrupted double-linked list: 0x0000000002294300/p/home/noamb/src/work/Skutterudites/GK_MD/LAMMPS/clean_rerun/lmp ***
': double free or corruption (!prev): 0x0000000002294340 ***
======= Backtrace: =========

The backtrace that follows is not useful, but if I use gdb to view the core dump, I get:

akohlmey · February 26, 2021, 3:00pm

Noam,

The stack traces point to a corruption of communication buffers, possible due to an overflow. This can happen if you have an unusual situation that gets around the current heuristics and it may be possible to identify a place where a check and accurate computation could be added or that there is a required setting in the input that needs to be adjusted.

can you provide a sample input deck that reproduces this?

Thanks,
Axel.

lammps-users · February 26, 2021, 3:03pm

Noam,

The stack traces point to a corruption of communication buffers, possible due to an overflow. This can happen if you have an unusual situation that gets around the current heuristics and it may be possible to identify a place where a check and accurate computation could be added or that there is a required setting in the input that needs to be adjusted.

can you provide a sample input deck that reproduces this?

I can, but first let me see how small I can make it. It won’t be that small, since the problem only happens with very long range interactions and since they are bonded the system has to be more than twice as big, but maybe I can reduce it somewhat.

Noam

akohlmey · February 26, 2021, 3:06pm

best you put it into a .tar.gz file and send it to me privately (there is a size limit for mailing list emails).

thanks, axel.

akohlmey · February 27, 2021, 12:04am

putting the conversation back to the mailing list, so that people (hopefully) get to see what the resolution is going to be.
this far I have been unable to reproduce the segmentation fault. I am running on a desktop with (only) 4 CPU cores.

Instead I have been initially running into overflow issues when compiling the special neighbors lists.
Since you don’t use a pair potential, those are not really needed and because of the many bonds, they become substantial and consume time and memory.
This can be avoided by using a special_bonds setting of lj/coul 1.0 1.0 1.0 that reduces the need to only collect 1-2 neighbors for building the bond list.
By adding this setting, I was able to run the simulation, but am then running into the “bond atom missing” issue while the pressure is becoming quite large.
I have also massively reduced the time spent on collecting the 1-2,1-3,1-4 and special neighbor information.

The heuristic based estimator for the communication cutoff inside of LAMMPS suggests a larger communication cutoff of 16.5 angstrom, but that didn’t change anything. Nor did using an even larger cutoff.

The stack trace indicates that this happens in the comm->exchange() call, which would happen before a reneighboring.
Further experimentation indicates that the “bond atom missing” error I get coincides with the first reneighboring at step 8500
Enforcing regular and more frequent reneighboring doesn’t seem to change the physics of the system, because at 8500 steps things go south.
I observe a continuously increasing pressure (and rather large to boot), but then in the step before the reneighboring fails I note that the bonded energy after slowly increasing suddenly becomes negative, which indicates that there may be some numerical overflow in the computation of the forces happening.
Enabling floating point exceptions confirms this. there is a bond that is over 10**80 long.
Switching from fix nvt to using fix nve/limit with fix langevin shows that bad things happen even before that and without triggering floating point exceptions.

in summary, it looks like the segfault can be avoided by avoiding determination of 1-3,1-4, and special neighbors with changing the special_bonds setting from the default of 0.0 0.0 0.0 to 1.0 1.0 1.0, but the remaining bond atom missing issue seems to be due to problems with the physics of the model.

axel.

akohlmey · March 1, 2021, 2:02pm

It didn’t break for me. I did get messages that test for 32-bit integer overflows. So it seems to be that your case is marginal in that it pushes the communication buffers just a bit over the 32-bit limit for the total buffers without being noticed for the individual (and by far largest) contribution from the special neighbor rebuild.

axel.

akohlmey · March 1, 2021, 1:47pm

please keep the conversation on the mailing list.

the key to avoiding the segfault would not be the turning off of the neighbor list rebuilds (those just avoid the bond atom missing errors), but rather the special_bonds setting, which avoids the useless, and time consuming check to build 1-3, 1-4, and special neighbor lists and the associated communication and large memory allocations.

Axel.

lammps-users · March 1, 2021, 2:51pm

Thanks for the workaround, although I still think it doesn’t seem like a good sign that the memory handling breaks, even in this admittedly unphysical situation.

It didn’t break for me. I did get messages that test for 32-bit integer overflows. So it seems to be that your case is marginal in that it pushes the communication buffers just a bit over the 32-bit limit for the total buffers without being noticed for the individual (and by far largest) contribution from the special neighbor rebuild.

If it’s a communication buffer, it could be dependent on the number of MPI processes, I suppose. I went back to the default handling of special_bonds, and with 4 processes I got

ERROR: Overflow input size in rendezvous_a2a (src/work/LAMMPS/lammps/src/comm.cpp:1026)

but with 8 I got the realloc seg fault. If your machine will let you run MPI oversubscribed you might be able to reproduce the seg fault without too much resource demand. The runs are only around ten seconds for me, even with that few cores.

lammps-users · March 1, 2021, 1:58pm

please keep the conversation on the mailing list.

Sorry - I didn’t realize you also wanted the non-substantive followups on the list as well

the key to avoiding the segfault would not be the turning off of the neighbor list rebuilds (those just avoid the bond atom missing errors), but rather the special_bonds setting, which avoids the useless, and time consuming check to build 1-3, 1-4, and special neighbor lists and the associated communication and large memory allocations.

OK - I’ve tried that, and it does indeed prevent the seg fault. Now I’m getting Bond atoms missing errors, which as I said is consistent with these pair potentials just being unstable. That should be enough for me to figure out what’s going on with the physics in this model.

Thanks for the workaround, although I still think it doesn’t seem like a good sign that the memory handling breaks, even in this admittedly unphysical situation.

akohlmey · March 1, 2021, 6:25pm

I don’t get a realloc segfault, but just the bond atom missing error. however, running with valgrind shows an off-by-one memory access. However, that seems to be happening after the system went to a bad state (i.e. E_mol went from about 250 to about -100 and the pressure more than doubled). So it appears that there is first the bad physics leading to inconsistent domain distribution data leading to memory corruption leading to either bond atom missing or segmentation fault. I currently don’t see anything that could be done here on the LAMMPS side. If you want to stop before things go too bad, you could try using fix halt with a threshold on ebond or emol or pressure.

Axel.

lammps-users · March 1, 2021, 6:28pm

OK - I agree that it's likely the first problem is always a dynamics stability issue. Thanks for the energy threshold suggestions on how to catch it before it goes too far.

Noam