Recovery rerun does not work in specific cases

I would like to enable recovery rerun (via spec._recovery) which seems to me to be only possible with the --task-level flag. It works for any number of tasks and any task failed when the Firework is a root node. When the firework has parents (see example below), the recovery works only when the first task has been completed and fails if the first task fails. I have found that the error below is because the Rocket.run does not write a checkpoint to the launch in that specific case.

Question 1: Is this a necessary constraint or a bug?
Question 2: Is is possible to enable recovery upon rerun for one-task fireworks? Or for many-task fireworks with failed first task? Because the directory of the first task already contains useable data that can shorten the next run of that task.

In [1]: from fireworks.fw_config import LAUNCHPAD_LOC
   ...: from fireworks import LaunchPad
   ...: from fireworks import Firework, Workflow
   ...: from fireworks.user_objects.firetasks.script_task import ScriptTask, PyTask
   ...: fw_1 = Firework([ScriptTask(script='echo Hello 1')])
   ...: fw_2 = Firework([PyTask(func='time.sleep', args=[20]), ScriptTask(script='echo Hello 2')])
   ...: wf = Workflow([fw_1, fw_2], links_dict={fw_1: fw_2})
   ...: lpad = LaunchPad.from_file(LAUNCHPAD_LOC)
   ...: lpad.add_wf(wf)
Out[1]: {-2: 1, -1: 2}
$ rlaunch rapidfire
Hello 1
^CInterrupted by signal 2
Traceback (most recent call last):
  File "/mnt/data/ubuntu/work/python-3.10.12/lib/python3.10/site-packages/fireworks/core/rocket.py", line 261, in run
    m_action = t.run_task(my_spec)
  File "/mnt/data/ubuntu/work/python-3.10.12/lib/python3.10/site-packages/fireworks/user_objects/firetasks/script_task.py", line 187, in run_task
    output = func(*args, **kwargs)
  File "/mnt/data/ubuntu/work/python-3.10.12/lib/python3.10/site-packages/fireworks/scripts/rlaunch_run.py", line 45, in handle_interrupt
    sys.exit(1)
SystemExit: 1
$ lpad rerun_fws --task-level --copy-data -i 1
Traceback (most recent call last):
  File "/mnt/data/ubuntu/work/python-3.10.12/bin/lpad", line 7, in <module>
    sys.exit(lpad())
  File "/mnt/data/ubuntu/work/python-3.10.12/lib/python3.10/site-packages/fireworks/scripts/lpad_run.py", line 1578, in lpad
    args.func(args)
  File "/mnt/data/ubuntu/work/python-3.10.12/lib/python3.10/site-packages/fireworks/scripts/lpad_run.py", line 640, in rerun_fws
    lp.rerun_fw(int(fw_id), recover_launch=l_id, recover_mode=args.recover_mode)
  File "/mnt/data/ubuntu/work/python-3.10.12/lib/python3.10/site-packages/fireworks/core/launchpad.py", line 1725, in rerun_fw
    recovery = self.get_recovery(fw_id, recover_launch)
  File "/mnt/data/ubuntu/work/python-3.10.12/lib/python3.10/site-packages/fireworks/core/launchpad.py", line 1769, in get_recovery
    recovery.update(_prev_dir=launch.launch_dir, _launch_id=launch.launch_id)
AttributeError: 'NoneType' object has no attribute 'update'

Thanks for reporting this. I have not played around with this much but it seems to be a bug rather than an intended feature. If you know how to implement the fix I am happy to merge the PR and release, just let me know if that’s possible.

@Anubhav_Jain Thank you very much for your prompt reply. I fixed the issue in PR [BUG] Recovery rerun by ikondov · Pull Request #567 · materialsproject/fireworks · GitHub. Also please kindly consider another bugfix from December: [BUG] Fix redirecting traceback from stderr to a file by ikondov · Pull Request #563 · materialsproject/fireworks · GitHub. Both are ready for review!

Here just a summary: The LaunchPad.ping_launch() method is called concurrently in the main thread and in the “hartbeat” thread started at this line. The call from the main thread passes a checkpoint while the call from the heartbeat thread does not. This creates a race condition, where the winners are the first read and the last write to the database, see this line. The bug occurs or doesn’t depending on which of the two call stacks takes longer. Setting the heartbeat to wait first (default 3600 s) solves the issue.

Best regards,
Ivan

Thank you! I have released v2.0.9 which includes your PRs and a few others.

Btw I have re-enabled my Github notifications (which were off for a long while) so hopefully will see some of your PRs more quickly in the future. Thanks as always for your contributions.