[lammps-users] Efficiently restarting an aborted simulation

Alexander_Stukowski1 · July 14, 2008, 1:21pm

Dear LAMMPS developers and users,

in our research group we are using LAMMPS for several long-running
simulations. Since we are running LAMMPS on a public computer cluster which
has a 12 hour wallclock time limit for computing jobs, we have to restart the
simulation several times to reach the desired number of time steps. At the
moment we do this by hand which basically requires the following steps:

1. Look into the log files and determine the last time step at which a restart
file has been written.

2. Write a new LAMMPS script file that loads the last restart file and then
runs for the remaining number of timesteps.

3. Write a new job file for the new LAMMPS script and submit it to the queue.

This procedure is very tedious and it is almost impossible to automate it in
some way. I would like to know from the LAMMPS community if you also consider
this as a problem or if you already have found a solution for it which I have
missed.

I would propose to solve this issue by adding a new function to LAMMPS.

Basically, it would be sufficient to add a new mode of operation to LAMMPS
which makes it skip a defined number of timesteps in the input file. That
means it would process all regular commands like fix/compute etc. but not the
run/dump/write commands up to a given number of timesteps. Then it should
somehow load the specified restart file and continue with normal execution.
This would make it unnecessary to write a new LAMMPS input file to continue
an aborted simulation.

I know that this would require some major changes to the LAMMPS code and maybe
no one is willing to do it. But I wanted to call the developers' attention to
this issue and put my proposal up for discussion.

Regards,
Alex

_Vikas_Varshney2 · July 14, 2008, 1:53pm

Dear Alex,
If you can submit a lot of jobs in a queue at once, then you can write a batch fie.
In the batch file you keep restarting the simulation every X timesteps ( X is no. of timesteps which finishes before 12 hrs).
Just write one in.restart file and one in.firststart.

akohlmey · July 14, 2008, 2:16pm

Dear LAMMPS developers and users,

dear alex,

it looks as if you not seen how other people use restarting.
basically all you want is already there, you only have
to look at the problem differently. you basically have to
depend on your job being finished before the queue time is
up the restart facility in lammps is quite flexible and does
not need any changes (as far as i am concerned).

when using a batch system the first thing you have to do
is to find out how many MD steps you can fit into a "slot"
and then you reduce that with a safety margin (say ~30mins).
when you know how many steps you can do in that time, you
can chop your trajectory into pieces of equal length.
that also makes post-processing easier.

if you put a write_restart at the end of your input script
and a read_restart at the proper place in all but the very
first input, you can continue your run without manual intervention.

the only thing you have to watch out for is that trajectory
files get overwritten. so either you write a small script
that generates a bunch of input files so that each lammps
run segment has a different output file or you can rename
in the job submit script, after the job has been completed.

i prefer the latter using the queue system job id as number
to have truely unique output names. this way you can reuse
the same input over and over again until your trajectory is
done. i've been running trajectories that were running in
12hour chunks for over two weeks. and to continue you only
need to submit the submit script a few times more. when using
job dependcies from the batch system, this can be mostly run
unatteded. using the restart keyword in between is for safety
in case the job crashes unexpectedly (when running acros hundreds
of cpus, failures become more likely...).

feel free to get back to me, if you need example submit scripts
(i should have examples for Load Leveler and PBS/Torque).

cheers,
axel.

p.s.: i just saw vikas' reply. that could actually make the
rename script that i'm using mostly obsolete (i also use it
archive the files to the hierarchical storage automatically,
and that still needs to be done).

sjplimp · July 14, 2008, 2:26pm

See the "run upto" command and the "read_restart file.*"
command (with an asterisk).

These were designed to address the problem
you describe.

Steve