Hi y’all,
I have been using FireWorks with atomate2 for a few months now, and it has been great. However, I often find that my jobs do not complete within the assigned walltime, and increasing the walltime is not an option. I have been trying to find ways to continue these jobs. My current solution is pretty janky - copying the CONTCAR from the run directory and updating the firework spec with the new structure.
# Get the last structure from the run directory
fw = launchpad.get_fw_by_id(fw_id)
struct = Structure.from_file(fw.launches[-1].launch_dir + "/CONTCAR")
# Update the firework spec in the job store with the new structure
launchpad.update_spec([fw_id], {"_tasks.0.job.function_args": [struct.as_dict()]})
# Re-run the firework
launchpad.rerun_fw(fw_id)
This isn’t ideal - the electron density isn’t copied over, and it requires me to run a script at regular intervals to check which runs are left as “RUNNING” (because the db never gets updated that it hits the walltime).
Is there a better solution? I am vaguely aware that there is a checkpoint system for jobs, but I am not yet sure if that would solve this problem. Ideally I would immediately submit a new continuation job of VASP runs that hit the walltime.
Thanks,
Kat Nykiel