I am experiencing an increasing number of LockedWorkflowError
when running large batches of calculations (in which 100’s or 1,000 FWs might be running simultaneously), to the point that they have become a considerable obstacle to usability.
My workflows comprise 2 fireworks each, and I normally execute them using a job packing script, so that a certain number of computing nodes are busy for a specified wall time (usually 48hr), continually running fireworks until the time limit is exhausted.
After the job finishes, I am left with a large number of Fireworks in RUNNING status that simply ran out of time, and I need to reset them to READY status before I start the next job. To do this, I use lpad rerun_fws -s RUNNING
or lpad detect_lostruns
Either of the above commands will fail midway through if a Workflow is locked, with something like
(prod-r2scan)rsking84@cori01:~> lpad rerun_fws -q '{"spec.tags":"Dec2021"}' -s RUNNING
Are you sure? This will modify 203 entries. (Y/N)y
Traceback (most recent call last):
File "/global/cscratch1/sd/rsking84/production-scan/prod-r2scan/bin/lpad", line 8, in <module>
sys.exit(lpad())
File "/global/cscratch1/sd/rsking84/production-scan/prod-r2scan/lib/python3.8/site-packages/fireworks/scripts/lpad_run.py", line 1538, in lpad
args.func(args)
File "/global/cscratch1/sd/rsking84/production-scan/prod-r2scan/lib/python3.8/site-packages/fireworks/scripts/lpad_run.py", line 630, in rerun_fws
lp.rerun_fw(int(f), recover_launch=l, recover_mode=args.recover_mode)
File "/global/cscratch1/sd/rsking84/production-scan/prod-r2scan/lib/python3.8/site-packages/fireworks/core/launchpad.py", line 1717, in rerun_fw
with WFLock(self, fw_id):
File "/global/cscratch1/sd/rsking84/production-scan/prod-r2scan/lib/python3.8/site-packages/fireworks/core/launchpad.py", line 139, in __enter__
raise LockedWorkflowError(f"Could not get workflow - LOCKED: {self.fw_id}")
fireworks.core.launchpad.LockedWorkflowError: Could not get workflow - LOCKED: 364415
The only workaround I know is to lpad admin unlock -i <fw_id>
for the FireWork in question. The problem is that the LockedWorkflowError
is triggered one at a time, so I have to repeatedly cycle through lpad rerun_fws
and lpad admin unlock
commands until I get through all the running FireWorks.
This is made more frustrating by the fact that each of the above steps takes minutes or more to complete. So in order to make my launchpad ready for the next job, instead of running one lpad rerun_fws -s RUNNING
command and moving on with my day I have to:
lpad rerun_fws -s RUNNING
wait 1-5 minutes
get LockedWorkflowError
lpad admin unlock
repeat however many times it takes
So what should be a simple task winds up consuming a lot of attention over an extended period (hours).
I have encountered this problem for a long time, but typically in a batch of ~250 running Fireworks there are only 1-5 that are locked. In the last ~month, however, the number of LockedWorkflowError
has increased substantially. As an example, after a recent job that left 222 FireWorks in the running state, I am finding that every 5 or 10 FireWorks is locked. I have been iterating through the above steps for about 3 hours already today and have still only managed to rerun 20 of them.
Is there a way to bulk unlock locked fireworks? Or would it be possible to add a --force
option to rerun_fws
so that it will forcibly rerun even the locked ones?
Thanks in advance for any advice on this!