Painful resolution of LockedWorkflowError with large jobs

rkingsbury · January 14, 2022, 10:31pm

I am experiencing an increasing number of LockedWorkflowError when running large batches of calculations (in which 100’s or 1,000 FWs might be running simultaneously), to the point that they have become a considerable obstacle to usability.

My workflows comprise 2 fireworks each, and I normally execute them using a job packing script, so that a certain number of computing nodes are busy for a specified wall time (usually 48hr), continually running fireworks until the time limit is exhausted.

After the job finishes, I am left with a large number of Fireworks in RUNNING status that simply ran out of time, and I need to reset them to READY status before I start the next job. To do this, I use lpad rerun_fws -s RUNNING or lpad detect_lostruns

Either of the above commands will fail midway through if a Workflow is locked, with something like

(prod-r2scan)rsking84@cori01:~> lpad rerun_fws -q '{"spec.tags":"Dec2021"}' -s RUNNING
Are you sure? This will modify 203 entries. (Y/N)y
Traceback (most recent call last):
  File "/global/cscratch1/sd/rsking84/production-scan/prod-r2scan/bin/lpad", line 8, in <module>
    sys.exit(lpad())
  File "/global/cscratch1/sd/rsking84/production-scan/prod-r2scan/lib/python3.8/site-packages/fireworks/scripts/lpad_run.py", line 1538, in lpad
    args.func(args)
  File "/global/cscratch1/sd/rsking84/production-scan/prod-r2scan/lib/python3.8/site-packages/fireworks/scripts/lpad_run.py", line 630, in rerun_fws
    lp.rerun_fw(int(f), recover_launch=l, recover_mode=args.recover_mode)
  File "/global/cscratch1/sd/rsking84/production-scan/prod-r2scan/lib/python3.8/site-packages/fireworks/core/launchpad.py", line 1717, in rerun_fw
    with WFLock(self, fw_id):
  File "/global/cscratch1/sd/rsking84/production-scan/prod-r2scan/lib/python3.8/site-packages/fireworks/core/launchpad.py", line 139, in __enter__
    raise LockedWorkflowError(f"Could not get workflow - LOCKED: {self.fw_id}")
fireworks.core.launchpad.LockedWorkflowError: Could not get workflow - LOCKED: 364415

The only workaround I know is to lpad admin unlock -i <fw_id> for the FireWork in question. The problem is that the LockedWorkflowError is triggered one at a time, so I have to repeatedly cycle through lpad rerun_fws and lpad admin unlock commands until I get through all the running FireWorks.

This is made more frustrating by the fact that each of the above steps takes minutes or more to complete. So in order to make my launchpad ready for the next job, instead of running one lpad rerun_fws -s RUNNING command and moving on with my day I have to:

lpad rerun_fws -s RUNNING
wait 1-5 minutes
get LockedWorkflowError
lpad admin unlock
repeat however many times it takes

So what should be a simple task winds up consuming a lot of attention over an extended period (hours).

I have encountered this problem for a long time, but typically in a batch of ~250 running Fireworks there are only 1-5 that are locked. In the last ~month, however, the number of LockedWorkflowError has increased substantially. As an example, after a recent job that left 222 FireWorks in the running state, I am finding that every 5 or 10 FireWorks is locked. I have been iterating through the above steps for about 3 hours already today and have still only managed to rerun 20 of them.

Is there a way to bulk unlock locked fireworks? Or would it be possible to add a --force option to rerun_fws so that it will forcibly rerun even the locked ones?

Thanks in advance for any advice on this!

mkhorton · January 14, 2022, 10:37pm

Hi @rkingsbury,

I don’t know of a long-term resolution to this; I think design changes are warranted. There could also be easier ways to bypass the lock.

The step I would advise to manually bypass would be to edit these two lines in fw_config.py:

WFLOCK_EXPIRATION_SECS = 60 * 5  # wait this long for a WFLock before expiring
WFLOCK_EXPIRATION_KILL = False  # kill WFLock on expiration (or give a warning)

e.g. to change the expiration to 0 seconds, and kill the lock on expiration.

Note, you should only do this if you have confident you definitely do not have any workflows currently running.

Best,

Matt

rkingsbury · January 18, 2022, 7:05pm

Thanks for the suggestion @mkhorton . To clarify, are these changes that I can leave in place permanently, or should I only edit these lines until I rerun my fireworks, and the change them back? (I don’t have a good understanding of what WFLock does or when it’s invoked). In this context, do you know what “expiration” means?

mkhorton · January 18, 2022, 7:55pm

I would only make these changes while you have no calculations running.

In general terms, the purpose of a lock is to ensure that any operations do not leave the database in an inconsistent state. Ideally, this would mean that if you are doing an update to the database (eg retrieving a FireWork, marking it as running, etc.) this would all be done in a single operation that would either succeed or fail.

However, as FireWorks is currently architected, there could be a situation whereby one FWorker pulls a FireWork to run, and then another FWorker pulls the same FireWork to run. The WFLock tries to avoid this (or similar) situations from happening. The idea is that the FireWork will only be locked for a very brief period of time while it is being updated and then unlocked – if this is not happening, and it’s taking a long time for the database to be updated, such that you have many FireWorks stuck in a “locked” state, this suggests there are larger problems going on.

The “expiration” is how long FireWorks waits for the FireWork to be unlocked if it is locked, and if it remains locked after that duration, the “expiration_kill” boolean decides what action is taken: is the lock discarded, on the assumption the lock is no longer necessary but risking the FireWork being put into an inconsistent state, or is a warning given, which requires manual intervention.

Again, in general terms, it’s better not to have to implement your own locks; that sort of thing should be left purely for the database system to worry about, but I’m sure there was a good reason it was done in this instance.

This is my current understanding!

Best,

Matt

rkingsbury · January 18, 2022, 8:04pm

Thanks for elaborating Matt! That helps.

Anubhav_Jain · January 18, 2022, 11:40pm

@rkingsbury Does your workflow have a lot of FireWorks going on in parallel (e.g., not a million workflows with a single firework, but a single workflow with a million fireworks)? If so that can cause locking issues that are difficult but I just want to check if that’s the case.

The goal of locking is as follows: let’s say FireWorks A and B are running in parallel as part of the same workflow. If FireWork A completes it needs to update some keys in the workflow document itself (e.g, workflow state). While ti does that update, we don’t want FireWork B to also complete and try to update the workflow document simultaneously before FireWork A completes its write.

Anubhav_Jain · January 19, 2022, 11:44pm

I’ll also add that increasing the WFLOCK_EXPIRATION_SECS can’t really hurt but I would only expect it would help if you really have a ton of jobs finishing at the same time OR your database connection is super slow, which is why it might take more than 3 minutes for a FireWork to wait its “turn” to write to the workflow. The only downside to increasing this parameter is you might spend longer waiting around for nothing.

For WFLOCK_EXPIRATION_KILL you risk having inconsistent workflows and problems if you turn this on. But IIRC it’s no worse than always manually killing locks (just that it’ll happen without you knowing it so might get out of hand). I didn’t have a chance to reivew the code again but I think this will just kill locks if the expiration time hits … e.g if a FireWork cant’ get its turn to write to the Workflow it will just do it anyway. If you do turn this on you might want to bump up the expiration time first as well.

Anubhav_Jain · January 19, 2022, 11:50pm

Also I just re-read the original thread …

Neither of those two parameters will really help if the problem is:

FireWork locks the workflow
FireWork starts updating the workflow
Node crashes or job walltime hits before FireWork finishes writing and unlocks the Workflow

If that leads to locked workflows you do need to manually unlock them … having a more usable way to do this would be nice but was never implemented. One thing you can do is try to make sure you don’t hit walltime, e.g. leave more room than is necessary for your jobs … to the extent this is possible.

rkingsbury · January 20, 2022, 5:07am

Thanks for all the extra explanation @Anubhav_Jain

@rkingsbury Does your workflow have a lot of FireWorks going on in parallel (e.g., not a million workflows with a single firework, but a single workflow with a million fireworks)? If so that can cause locking issues that are difficult but I just want to check if that’s the case.

My slurm job runs 250-500 FireWorks in parallel at a time (but each is in a separate workflow), so it sounds like some locking issues are expected.

I’ll also add that increasing the WFLOCK_EXPIRATION_SECS can’t really hurt but I would only expect it would help if you really have a ton of jobs finishing at the same time OR your database connection is super slow, which is why it might take more than 3 minutes for a FireWork to wait its “turn” to write to the workflow.

Neither of those two parameters will really help if the problem is:

FireWork locks the workflow
FireWork starts updating the workflow
Node crashes or job walltime hits before FireWork finishes writing and unlocks the Workflow

At the scale I’ve been running, it’s definitely plausible that I had a lot of FWs finishing around the same time. And if NERSC had some kind of transient database connection speed issue, maybe that caused a backlog that resulted in my higher-than-normal number of locking errors?

At any rate, it’s seems clear that these locking problems are at least to some extent a byproduct of running so many calcs in parallel, and hence hopefully won’t bother too many other users.

ubyjvovk · March 20, 2024, 9:32pm

@rkingsbury we had the same problem with fireworks and decided to rewrite parts of Launchpad (created our subclass with better locking). I could share the solution if you’re interested

rkingsbury · March 28, 2024, 9:29am

Thanks @ubyjvovk ! I’m not actively using FireWorks currently, but if you think your modified LaunchPad might be useful to other users, I’d encourage you to open a PR against FireWorks itself!

Anubhav_Jain · March 28, 2024, 3:51pm

Agree, there are many people for whom a resolution to this would be useful.

I am not sure of the current code structure, but ideally it would be better to just have the changes directly in LaunchPad assuming it doesn’t break other features or create different performance issues. That way the improvements are just automatically there for everyone.