I’m running many flows of VASP jobs, followed by a post processing a cleanup job. The pos processing and cleanup job take the a list of all vasp_dirs as input.
I’d like to post process all successfully jobs - but if the last of the VASP chain fails, no post processing will occur.
Do you have any suggestions for how best to do this?
Is there a way to stop this post processing from being cancelled, whilst still using atomate2 classes?
I see that you already marked this as solved and I don’t know if this may be the reason you are seeing this issue. But usually if a VASP job does not complete successfully and is marked as. FAILED the children do not switch to the READY state (both in jobflow-remote and fireworks).
A way to allow the child job to run even if the parents fail is to set the on_missing_references in the JobConfig of the cleanup Job:
Howver, it is not entirely clear what is happening there, as I am a bit surprised that you need to rerun waiting jobs. I would have expected it to become ready without the user intervention, so maybe there is some bug in jobflow-remote. Can you provide some more details about the status of the jobs involved? How many parent jobs does the final cleanup job has? Even if they are all completed or failed the last remains waiting?
I am also afraid that the waiting job being switched to ready by “jf job rerun” may also be a bug on its own if the parent states should not allow that.
I’m doing jf job rerun -f, this doesn’t work without the -f.
I have not set on_missing_references for all child jobs, since I want the DFT calculation chain to stop if one job fails. However, I still want to cleanup and process what has already run.
I believe that I need to use jf job rerun -f since the DFT child jobs are still WAITING
If this is unexpected I can provide more details, maybe in a GitHub issue if that’s better.
Thanks for providing more details. I think I understand the point now. The final job depends on all the previous ones, some of which are COMPLETED, at least one FAILED and some still WAITING. So it is correct that the cleanup job does not switch to READY.
However, I think that, for how it is defined, rerun should not switch the job to READY, as it should instead only work on jobs that have already been executed. So I will fix this in the next release.
I see that it is still helpful in your case, but you can use jf job set-state instead. With that you can set any arbitrary state. In principle it was developed mostly for debugging, but I think it suits your case well.
Actually, if you think it would be interesting, we can also consider creating a jf flow clean to delete all the files of a specific flow. Altought in that case you would need to always run this manually, even if the workflow completes successfully. So I am not sure if it would be more convenient than your current solution.