Doubt in LAMMPS: Asynchronous Checkpointing

Hey,
I am trying to modify the restart module (i.e write_restart function in output.cpp) such that
1.) whenever SIGINT/SIGTERM is called, the signal is catched and
2.) The control goes to the signal handler function which saves the current state and then quits,
It looks something like this

WriteRestart *restart; // declared before in output.h. just showing for reference.
void Output::async_catch(int d)
{
if((d==SIGINT) || (d==SIGTERM))
{
async_output->async_write_restart();
}
}

void Output::async_write_restart()
{
char *file=“some_filename”;
printf(“Bkill signal detected.Attempting to restart…\n”);
restart->write(file);
printf(“Success. Restart done! \n”);
delete[] file;
last_restart=ntimestep;
}

I wrote this in output.cpp. Here I am getting segmentation fault at the function restart->write. It is not entering the function.

please help.

Hey,
I am trying to modify the restart module (i.e write_restart function in
output.cpp) such that
1.) whenever SIGINT/SIGTERM is called, the signal is catched and
2.) The control goes to the signal handler function which saves the
current state and then quits,
It looks something like this

WriteRestart *restart; // declared before in output.h. just showing for
reference.
*void Output::async_catch(int d)*
*{*
*if((d==SIGINT) || (d==SIGTERM))*
*{*
*async_output->async_write_restart();*
*}*
*} *

*void Output::async_write_restart()*
*{*
*char *file="some_filename";*
*printf("Bkill signal detected.Attempting to restart...\n");*
*restart->write(file);*
*printf("Success. Restart done! \n");*
*delete file;*
*last_restart=ntimestep;*
*}*

I wrote this in output.cpp. Here I am getting *segmentation fault* at the
function restart->write. It is not entering the function.
please help.

​your approach is wrong on many levels.

- it will not work in parallel, because signals may not be delivered to all
processes
- even if all parallel processes will receive a signal, it will not be
received at the same time and different tasks may be within different
contexts.
- restarts can only be written at a very specific​ point during execution.

axel.

here is a recipe for how this can be done cleanly.

  • create a new fix style that would be at the end_of_step() point in the MD loop.

  • define a static variable that will contain the caught_signal flag (initialized to 0, of course).

  • define a static signal handler function with C bindings that will set the flag to 1.

  • in the end_of_step() procedure of the new fix, you have to do a MPI_MAX or MPI_SUM reduction of the caught_signal flag into a do_restart flag.

if the resulting flag is > 0, then you can write out a restart and then abort the run.

in addition, i would suggest to also check for a file, e.g. lmp.EXIT, whose presence would cause the same effect as catching a signal. this would allow to do a “soft exit” in the middle of a run without losing data via “touch lmp.EXIT”.

that being said, this kind of approach is used in many DFT MD codes, but generally overkill for classical MD.
the normal restart command with two alternating restart files does just fine. with the exception of the most extreme simulations, the cost of writing a restart is typically small and with a suitable choice the restart interval, very little time lost.

axel.