Best practice for continuing jobs that hit walltime?

KatNykiel · March 11, 2024, 3:28pm

Hi y’all,

I have been using FireWorks with atomate2 for a few months now, and it has been great. However, I often find that my jobs do not complete within the assigned walltime, and increasing the walltime is not an option. I have been trying to find ways to continue these jobs. My current solution is pretty janky - copying the CONTCAR from the run directory and updating the firework spec with the new structure.

# Get the last structure from the run directory
fw = launchpad.get_fw_by_id(fw_id)
struct = Structure.from_file(fw.launches[-1].launch_dir + "/CONTCAR")

# Update the firework spec in the job store with the new structure
launchpad.update_spec([fw_id], {"_tasks.0.job.function_args": [struct.as_dict()]})

# Re-run the firework
launchpad.rerun_fw(fw_id)

This isn’t ideal - the electron density isn’t copied over, and it requires me to run a script at regular intervals to check which runs are left as “RUNNING” (because the db never gets updated that it hits the walltime).

Is there a better solution? I am vaguely aware that there is a checkpoint system for jobs, but I am not yet sure if that would solve this problem. Ideally I would immediately submit a new continuation job of VASP runs that hit the walltime.

Thanks,

Kat Nykiel