FW locks in large workflow

michael.martin · July 5, 2018, 11:06pm

Hello FireWorks Users,

Background about my workflow

I am using fireworks on LBNL’s NERSC Cori system. I have a workflow which contains ~50,000 fireworks with no interdependencies. Each firework consists of two firetasks:

Scriptask which executes a parallel program (currently ~3 min cpu time /execution)
Scriptask which runs a small python script that processes some output

I am running on the 68-core KNL nodes so I am using ‘rocket_launch: rlaunch multi 34’ in my qadapter script with 2 threads in my Scriptasks for the parallel program.

Hundreds of thousands of these fireworks will eventually need to be run with an even more expensive version of the firetask #1 software.

Issue

When I only have a few thousand of these types of fireworks in my workflow everything runs perfectly well. However, when I scale up to ~50,000 I begin to get many the following type of error in the FW_job%.out file after ~10 fireworks have successfully completed:

2018-07-05 15:13:52,532 INFO fw_id 63526 locked. Can’t refresh!

I believe the fireworks’ lock attempts are expiring because they have waited past the config file parameter WFLOCK_EXPIRATION_SECS time limit without getting a lock on the DB because there are too many fireworks finishing around the same time and the database updates are apparently taking too long. I have extended the WFLOCK_EXPIRATION_SECS parameter from 5 min to 10 min but this did not solve the problem. In any case, I can’t even afford 5 minutes of downtime between each firework completion.

Setting the WFLOCK_EXPIRATION_KILL to True does not solve my problem because then there are too many fireworks forcing locks on the database.

I have tried the ‘lpad admin maintain --infinite --maintain_interval 60’ command to no avail. I have also poked around with the database profiler but I am not sure what to look at.

I have heard adding indices to the LaunchPad may improve update speed. I know I do this in the format LaunchPad(…, user_indices= ['spec.paramter1],…) but I am not sure what ‘parameter1’ should be.

Is a fix possible or is my workflow already too large?

Please let me know if more information would be helpful.

I would greatly appreciate any advice.

Thank you.

Anubhav_Jain · July 5, 2018, 11:16pm

Hi Michael

A quick question before getting too detailed - if there are no interdependencies between the Fireworks, then why not use a single workflow? The main reason to put multiple FWs into one workflow is to ensure that dependencies are executed correctly.

I am asking because the lock is specific to a workflow. If you instead used 50,000 workflows, each with a single FW (instead of 1 workflow with 50,000 Fireworks) then you probably wouldn’t run into the locking issue.

Best,

Anubhav

···

On Thu, Jul 5, 2018 at 4:06 PM [email protected] wrote:

Hello FireWorks Users,

Background about my workflow

I am using fireworks on LBNL’s NERSC Cori system. I have a workflow which contains ~50,000 fireworks with no interdependencies. Each firework consists of two firetasks:

Scriptask which executes a parallel program (currently ~3 min cpu time /execution)

Scriptask which runs a small python script that processes some output

I am running on the 68-core KNL nodes so I am using ‘rocket_launch: rlaunch multi 34’ in my qadapter script with 2 threads in my Scriptasks for the parallel program.

Hundreds of thousands of these fireworks will eventually need to be run with an even more expensive version of the firetask #1 software.

Issue

When I only have a few thousand of these types of fireworks in my workflow everything runs perfectly well. However, when I scale up to ~50,000 I begin to get many the following type of error in the FW_job%.out file after ~10 fireworks have successfully completed:

2018-07-05 15:13:52,532 INFO fw_id 63526 locked. Can’t refresh!

I believe the fireworks’ lock attempts are expiring because they have waited past the config file parameter WFLOCK_EXPIRATION_SECS time limit without getting a lock on the DB because there are too many fireworks finishing around the same time and the database updates are apparently taking too long. I have extended the WFLOCK_EXPIRATION_SECS parameter from 5 min to 10 min but this did not solve the problem. In any case, I can’t even afford 5 minutes of downtime between each firework completion.

Setting the WFLOCK_EXPIRATION_KILL to True does not solve my problem because then there are too many fireworks forcing locks on the database.

I have tried the ‘lpad admin maintain --infinite --maintain_interval 60’ command to no avail. I have also poked around with the database profiler but I am not sure what to look at.

I have heard adding indices to the LaunchPad may improve update speed. I know I do this in the format LaunchPad(…, user_indices= ['spec.paramter1],…) but I am not sure what ‘parameter1’ should be.

Is a fix possible or is my workflow already too large?

Please let me know if more information would be helpful.

I would greatly appreciate any advice.

Thank you.

–

You received this message because you are subscribed to the Google Groups “fireworkflows” group.

To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].

To post to this group, send email to [email protected].

Visit this group at https://groups.google.com/group/fireworkflows.

To view this discussion on the web visit https://groups.google.com/d/msgid/fireworkflows/fb576419-616e-4880-9184-3580b0bd84bd%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

–
Best,
Anubhav

Anubhav_Jain · July 5, 2018, 11:18pm

quick typo correction - “then why not use a single workflow?” should have read “then why use only a single workflow (instead of putting each FW in its own workflow)”

···

Best,
Anubhav

michael.martin · July 6, 2018, 8:52pm

Hello Anubhav,

I have implemented this fix and it is working great. I’m not sure why I didn’t try this out!

Thank you for such a fast reply.

Best,

Michael

···

On Thursday, July 5, 2018 at 4:19:18 PM UTC-7, ajain wrote:

quick typo correction - “then why not use a single workflow?” should have read “then why use only a single workflow (instead of putting each FW in its own workflow)”

On Thu, Jul 5, 2018 at 4:16 PM Anubhav Jain [email protected] wrote:

Hi Michael

A quick question before getting too detailed - if there are no interdependencies between the Fireworks, then why not use a single workflow? The main reason to put multiple FWs into one workflow is to ensure that dependencies are executed correctly.

I am asking because the lock is specific to a workflow. If you instead used 50,000 workflows, each with a single FW (instead of 1 workflow with 50,000 Fireworks) then you probably wouldn’t run into the locking issue.

Best,

Anubhav

On Thu, Jul 5, 2018 at 4:06 PM [email protected] wrote:

Hello FireWorks Users,

Background about my workflow

I am using fireworks on LBNL’s NERSC Cori system. I have a workflow which contains ~50,000 fireworks with no interdependencies. Each firework consists of two firetasks:

Scriptask which executes a parallel program (currently ~3 min cpu time /execution)

Scriptask which runs a small python script that processes some output

I am running on the 68-core KNL nodes so I am using ‘rocket_launch: rlaunch multi 34’ in my qadapter script with 2 threads in my Scriptasks for the parallel program.

Hundreds of thousands of these fireworks will eventually need to be run with an even more expensive version of the firetask #1 software.

Issue

When I only have a few thousand of these types of fireworks in my workflow everything runs perfectly well. However, when I scale up to ~50,000 I begin to get many the following type of error in the FW_job%.out file after ~10 fireworks have successfully completed:

2018-07-05 15:13:52,532 INFO fw_id 63526 locked. Can’t refresh!

I believe the fireworks’ lock attempts are expiring because they have waited past the config file parameter WFLOCK_EXPIRATION_SECS time limit without getting a lock on the DB because there are too many fireworks finishing around the same time and the database updates are apparently taking too long. I have extended the WFLOCK_EXPIRATION_SECS parameter from 5 min to 10 min but this did not solve the problem. In any case, I can’t even afford 5 minutes of downtime between each firework completion.

Setting the WFLOCK_EXPIRATION_KILL to True does not solve my problem because then there are too many fireworks forcing locks on the database.

I have tried the ‘lpad admin maintain --infinite --maintain_interval 60’ command to no avail. I have also poked around with the database profiler but I am not sure what to look at.

I have heard adding indices to the LaunchPad may improve update speed. I know I do this in the format LaunchPad(…, user_indices= ['spec.paramter1],…) but I am not sure what ‘parameter1’ should be.

Is a fix possible or is my workflow already too large?

Please let me know if more information would be helpful.

I would greatly appreciate any advice.

Thank you.

–

You received this message because you are subscribed to the Google Groups “fireworkflows” group.

To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].

To post to this group, send email to [email protected].

Visit this group at https://groups.google.com/group/fireworkflows.

To view this discussion on the web visit https://groups.google.com/d/msgid/fireworkflows/fb576419-616e-4880-9184-3580b0bd84bd%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

–
Best,
Anubhav

Anubhav_Jain · July 7, 2018, 12:39am

Hi Michael,

That’s great to hear!

Let us know if you encounter any more issues.

Best,

Anubhav

···

Best,
Anubhav