Hello FireWorks Users,
Background about my workflow
I am using fireworks on LBNL’s NERSC Cori system. I have a workflow which contains ~50,000 fireworks with no interdependencies. Each firework consists of two firetasks:
Scriptask which executes a parallel program (currently ~3 min cpu time /execution)
Scriptask which runs a small python script that processes some output
I am running on the 68-core KNL nodes so I am using ‘rocket_launch: rlaunch multi 34’ in my qadapter script with 2 threads in my Scriptasks for the parallel program.
Hundreds of thousands of these fireworks will eventually need to be run with an even more expensive version of the firetask #1 software.
When I only have a few thousand of these types of fireworks in my workflow everything runs perfectly well. However, when I scale up to ~50,000 I begin to get many the following type of error in the FW_job%.out file after ~10 fireworks have successfully completed:
2018-07-05 15:13:52,532 INFO fw_id 63526 locked. Can’t refresh!
I believe the fireworks’ lock attempts are expiring because they have waited past the config file parameter WFLOCK_EXPIRATION_SECS time limit without getting a lock on the DB because there are too many fireworks finishing around the same time and the database updates are apparently taking too long. I have extended the WFLOCK_EXPIRATION_SECS parameter from 5 min to 10 min but this did not solve the problem. In any case, I can’t even afford 5 minutes of downtime between each firework completion.
Setting the WFLOCK_EXPIRATION_KILL to True does not solve my problem because then there are too many fireworks forcing locks on the database.
I have tried the ‘lpad admin maintain --infinite --maintain_interval 60’ command to no avail. I have also poked around with the database profiler but I am not sure what to look at.
I have heard adding indices to the LaunchPad may improve update speed. I know I do this in the format LaunchPad(…, user_indices= ['spec.paramter1],…) but I am not sure what ‘parameter1’ should be.
Is a fix possible or is my workflow already too large?
Please let me know if more information would be helpful.
I would greatly appreciate any advice.