Painful resolution of LockedWorkflowError with large jobs

rkingsbury · January 20, 2022, 5:07am

Thanks for all the extra explanation @Anubhav_Jain

@rkingsbury Does your workflow have a lot of FireWorks going on in parallel (e.g., not a million workflows with a single firework, but a single workflow with a million fireworks)? If so that can cause locking issues that are difficult but I just want to check if that’s the case.

My slurm job runs 250-500 FireWorks in parallel at a time (but each is in a separate workflow), so it sounds like some locking issues are expected.

I’ll also add that increasing the WFLOCK_EXPIRATION_SECS can’t really hurt but I would only expect it would help if you really have a ton of jobs finishing at the same time OR your database connection is super slow, which is why it might take more than 3 minutes for a FireWork to wait its “turn” to write to the workflow.

Neither of those two parameters will really help if the problem is:

FireWork locks the workflow
FireWork starts updating the workflow
Node crashes or job walltime hits before FireWork finishes writing and unlocks the Workflow

At the scale I’ve been running, it’s definitely plausible that I had a lot of FWs finishing around the same time. And if NERSC had some kind of transient database connection speed issue, maybe that caused a backlog that resulted in my higher-than-normal number of locking errors?

At any rate, it’s seems clear that these locking problems are at least to some extent a byproduct of running so many calcs in parallel, and hence hopefully won’t bother too many other users.