Create Jobs Immune to 'Stop Children'

Thomas_Warford · October 16, 2025, 2:16pm

Hi,

I’m running many flows of VASP jobs, followed by a post processing a cleanup job. The pos processing and cleanup job take the a list of all vasp_dirs as input.

I’d like to post process all successfully jobs - but if the last of the VASP chain fails, no post processing will occur.

Do you have any suggestions for how best to do this?

Is there a way to stop this post processing from being cancelled, whilst still using atomate2 classes?

Thanks

Thomas_Warford · October 16, 2025, 2:21pm

This might have the answer - I need to dig deeper though.

I’d like to be able to do this without changing atomate2 code too much ideally.

github.com

materialsproject/jobflow/blob/bd9c0f78fbcca6f2b1d3215df5b300ebf202d94b/docs/tutorials/9-missing-references.ipynb#L7


      
          {
           "cells": [
            {
             "cell_type": "markdown",
             "metadata": {},
             "source": [
              "# Handling Job Dependencies that Can Fail\n",
              "\n",
              "In this tutorial, we will demonstrate how to handle missing references in JobFlow. This is useful when you have jobs that may fail, but you still want to proceed with the workflow.\n",
              "\n",
              "First, we import the necessary modules and define a job that can fail based on an input parameter.\n"
             ]
            },
            {
             "cell_type": "code",
             "execution_count": 2,
             "metadata": {},

Thomas_Warford · October 16, 2025, 2:27pm

I’m not actually sure stop_children is what is causing jobs to stop. I’ll mark this as completed for now

gpetretto · October 16, 2025, 2:39pm

Hi Thomas,

I see that you already marked this as solved and I don’t know if this may be the reason you are seeing this issue. But usually if a VASP job does not complete successfully and is marked as. FAILED the children do not switch to the READY state (both in jobflow-remote and fireworks).
A way to allow the child job to run even if the parents fail is to set the on_missing_references in the JobConfig of the cleanup Job:

github.com

materialsproject/jobflow/blob/bd9c0f78fbcca6f2b1d3215df5b300ebf202d94b/src/jobflow/core/job.py#L64


      
                  ``pass_manager_config`` such that a different configuration than
                  ``manger_config`` can be passed to downstream jobs.
          
              Returns
              -------
              JobConfig
                  A :obj:`JobConfig` object.
              """
          
              resolve_references: bool = True
              on_missing_references: OnMissing = OnMissing.ERROR
              manager_config: dict = field(default_factory=dict)
              expose_store: bool = False
              pass_manager_config: bool = True
              response_manager_config: dict = field(default_factory=dict)
          
          
          @overload
          def job(method: Callable | None = None) -> Callable[..., Job]:
              pass

Thomas_Warford · October 16, 2025, 9:15pm

Thanks for the reply @gpetretto

This works, although I do have to manually run the waiting cleanup jobs with

jf job rerun -f

Which I’m pretty happy doing!

gpetretto · October 17, 2025, 7:52am

Good to hear that it helped.

Howver, it is not entirely clear what is happening there, as I am a bit surprised that you need to rerun waiting jobs. I would have expected it to become ready without the user intervention, so maybe there is some bug in jobflow-remote. Can you provide some more details about the status of the jobs involved? How many parent jobs does the final cleanup job has? Even if they are all completed or failed the last remains waiting?
I am also afraid that the waiting job being switched to ready by “jf job rerun” may also be a bug on its own if the parent states should not allow that.

Thomas_Warford · October 23, 2025, 9:21am

Sure, to be clear:

I’m doing jf job rerun -f, this doesn’t work without the -f.
I have not set on_missing_references for all child jobs, since I want the DFT calculation chain to stop if one job fails. However, I still want to cleanup and process what has already run.
I believe that I need to use jf job rerun -f since the DFT child jobs are still WAITING

If this is unexpected I can provide more details, maybe in a GitHub issue if that’s better.

gpetretto · October 24, 2025, 11:55am

Thanks for providing more details. I think I understand the point now. The final job depends on all the previous ones, some of which are COMPLETED, at least one FAILED and some still WAITING. So it is correct that the cleanup job does not switch to READY.

However, I think that, for how it is defined, rerun should not switch the job to READY, as it should instead only work on jobs that have already been executed. So I will fix this in the next release.
I see that it is still helpful in your case, but you can use jf job set-state instead. With that you can set any arbitrary state. In principle it was developed mostly for debugging, but I think it suits your case well.

Actually, if you think it would be interesting, we can also consider creating a jf flow clean to delete all the files of a specific flow. Altought in that case you would need to always run this manually, even if the workflow completes successfully. So I am not sure if it would be more convenient than your current solution.