Handling fireworks with checkpoint/restart

stephanmohr1985 · October 18, 2019, 7:57am

Hi there,

I have a question concerning the handling of long runs with checkpoint/restart features:

Imagine I have a workflow that contains long MD runs (in my specific case with Gromacs). Each firework within the workflow executes a long MD run, with linear dependencies between the different fireworks (i.e. MD1 -> MD2 -> MD3 etc.).

Now it might well happen that the firework MD1 stops prior to full completion, e.g. because the wall time limit has been reached. This is detected by Gromacs, and the codes writes a checkpoint file and shuts down gracefully. Since no error is issued, the firework MD1 gets the status COMPLETED, and if we continue the execution of the workflow we will go on with the firework MD2, even though we should rather restart the firework MD1!

To circumvent this problem, one should have, and the end of the firework, the possibility to check whether the MD run has finished completely or not (I think I know how to do this), and then manually change the status from COMPLETED to something else (e.g. DEFUSED).

I thought that maybe this could be done with the FWAction object, but if I understand correctly this only allows to defuse the children of the firework and not the firework itself.

Is there any alternative way to do this?

I was also looking at this thread, but the solution that is proposed there (dynamically add new fireworks if a checkpoint is written) is not ideal. I prefer to execute the entire MD run in one firework and not in several ones.

Any help is appreciated.

Thanks,

Stephan

alex · October 18, 2019, 5:48pm

Hi Stephan,

FireWorks support has moved to Discourse. I’ve reposted your question there: https://hackingmaterials.discourse.group/t/handling-fireworks-with-checkpoint-restart/41

For issues in the future, we look forward to answering questions on the Discourse forum!

Thanks,

Alex

···

On Friday, October 18, 2019 at 12:57:42 AM UTC-7, [email protected] wrote:

Hi there,

I have a question concerning the handling of long runs with checkpoint/restart features:

Imagine I have a workflow that contains long MD runs (in my specific case with Gromacs). Each firework within the workflow executes a long MD run, with linear dependencies between the different fireworks (i.e. MD1 -> MD2 -> MD3 etc.).

Now it might well happen that the firework MD1 stops prior to full completion, e.g. because the wall time limit has been reached. This is detected by Gromacs, and the codes writes a checkpoint file and shuts down gracefully. Since no error is issued, the firework MD1 gets the status COMPLETED, and if we continue the execution of the workflow we will go on with the firework MD2, even though we should rather restart the firework MD1!

To circumvent this problem, one should have, and the end of the firework, the possibility to check whether the MD run has finished completely or not (I think I know how to do this), and then manually change the status from COMPLETED to something else (e.g. DEFUSED).

I thought that maybe this could be done with the FWAction object, but if I understand correctly this only allows to defuse the children of the firework and not the firework itself.

Is there any alternative way to do this?

I was also looking at this thread, but the solution that is proposed there (dynamically add new fireworks if a checkpoint is written) is not ideal. I prefer to execute the entire MD run in one firework and not in several ones.

Any help is appreciated.

Thanks,

Stephan

Anubhav_Jain · November 15, 2019, 10:24pm

Hi Stephan

Sorry for the late reply.

I would just add a custom FireTask that follows your GROMACS run Firetask (e.g., CheckFinishTask). That Firetask can check whether the run actually finished properly using custom code. If not, just throw any Python error (e.g. raise RuntimeError("MD run didn't finish properly and was checkpointed"). That error will force the Firework to FIZZLE