FW locks in large workflow

Hello FireWorks Users,

Background about my workflow

I am using fireworks on LBNL’s NERSC Cori system. I have a workflow which contains ~50,000 fireworks with no interdependencies. Each firework consists of two firetasks:

  1. Scriptask which executes a parallel program (currently ~3 min cpu time /execution)

  2. Scriptask which runs a small python script that processes some output

I am running on the 68-core KNL nodes so I am using ‘rocket_launch: rlaunch multi 34’ in my qadapter script with 2 threads in my Scriptasks for the parallel program.

Hundreds of thousands of these fireworks will eventually need to be run with an even more expensive version of the firetask #1 software.

Issue

When I only have a few thousand of these types of fireworks in my workflow everything runs perfectly well. However, when I scale up to ~50,000 I begin to get many the following type of error in the FW_job%.out file after ~10 fireworks have successfully completed:

2018-07-05 15:13:52,532 INFO fw_id 63526 locked. Can’t refresh!

I believe the fireworks’ lock attempts are expiring because they have waited past the config file parameter WFLOCK_EXPIRATION_SECS time limit without getting a lock on the DB because there are too many fireworks finishing around the same time and the database updates are apparently taking too long. I have extended the WFLOCK_EXPIRATION_SECS parameter from 5 min to 10 min but this did not solve the problem. In any case, I can’t even afford 5 minutes of downtime between each firework completion.

Setting the WFLOCK_EXPIRATION_KILL to True does not solve my problem because then there are too many fireworks forcing locks on the database.

I have tried the ‘lpad admin maintain --infinite --maintain_interval 60’ command to no avail. I have also poked around with the database profiler but I am not sure what to look at.

I have heard adding indices to the LaunchPad may improve update speed. I know I do this in the format LaunchPad(…, user_indices= ['spec.paramter1],…) but I am not sure what ‘parameter1’ should be.

Is a fix possible or is my workflow already too large?

Please let me know if more information would be helpful.

I would greatly appreciate any advice.

Thank you.

Hi Michael

A quick question before getting too detailed - if there are no interdependencies between the Fireworks, then why not use a single workflow? The main reason to put multiple FWs into one workflow is to ensure that dependencies are executed correctly.

I am asking because the lock is specific to a workflow. If you instead used 50,000 workflows, each with a single FW (instead of 1 workflow with 50,000 Fireworks) then you probably wouldn’t run into the locking issue.

Best,

Anubhav

···

On Thu, Jul 5, 2018 at 4:06 PM [email protected] wrote:

Hello FireWorks Users,

Background about my workflow

I am using fireworks on LBNL’s NERSC Cori system. I have a workflow which contains ~50,000 fireworks with no interdependencies. Each firework consists of two firetasks:

  1. Scriptask which executes a parallel program (currently ~3 min cpu time /execution)
  1. Scriptask which runs a small python script that processes some output

I am running on the 68-core KNL nodes so I am using ‘rocket_launch: rlaunch multi 34’ in my qadapter script with 2 threads in my Scriptasks for the parallel program.

Hundreds of thousands of these fireworks will eventually need to be run with an even more expensive version of the firetask #1 software.

Issue

When I only have a few thousand of these types of fireworks in my workflow everything runs perfectly well. However, when I scale up to ~50,000 I begin to get many the following type of error in the FW_job%.out file after ~10 fireworks have successfully completed:

2018-07-05 15:13:52,532 INFO fw_id 63526 locked. Can’t refresh!

I believe the fireworks’ lock attempts are expiring because they have waited past the config file parameter WFLOCK_EXPIRATION_SECS time limit without getting a lock on the DB because there are too many fireworks finishing around the same time and the database updates are apparently taking too long. I have extended the WFLOCK_EXPIRATION_SECS parameter from 5 min to 10 min but this did not solve the problem. In any case, I can’t even afford 5 minutes of downtime between each firework completion.

Setting the WFLOCK_EXPIRATION_KILL to True does not solve my problem because then there are too many fireworks forcing locks on the database.

I have tried the ‘lpad admin maintain --infinite --maintain_interval 60’ command to no avail. I have also poked around with the database profiler but I am not sure what to look at.

I have heard adding indices to the LaunchPad may improve update speed. I know I do this in the format LaunchPad(…, user_indices= ['spec.paramter1],…) but I am not sure what ‘parameter1’ should be.

Is a fix possible or is my workflow already too large?

Please let me know if more information would be helpful.

I would greatly appreciate any advice.

Thank you.

You received this message because you are subscribed to the Google Groups “fireworkflows” group.

To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].

To post to this group, send email to [email protected].

Visit this group at https://groups.google.com/group/fireworkflows.

To view this discussion on the web visit https://groups.google.com/d/msgid/fireworkflows/fb576419-616e-4880-9184-3580b0bd84bd%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.


Best,
Anubhav

quick typo correction - “then why not use a single workflow?” should have read “then why use only a single workflow (instead of putting each FW in its own workflow)”

···

Best,
Anubhav

Hello Anubhav,

I have implemented this fix and it is working great. I’m not sure why I didn’t try this out!

Thank you for such a fast reply.

Best,

Michael

···

On Thursday, July 5, 2018 at 4:19:18 PM UTC-7, ajain wrote:

quick typo correction - “then why not use a single workflow?” should have read “then why use only a single workflow (instead of putting each FW in its own workflow)”

On Thu, Jul 5, 2018 at 4:16 PM Anubhav Jain [email protected] wrote:

Hi Michael

A quick question before getting too detailed - if there are no interdependencies between the Fireworks, then why not use a single workflow? The main reason to put multiple FWs into one workflow is to ensure that dependencies are executed correctly.

I am asking because the lock is specific to a workflow. If you instead used 50,000 workflows, each with a single FW (instead of 1 workflow with 50,000 Fireworks) then you probably wouldn’t run into the locking issue.

Best,

Anubhav

On Thu, Jul 5, 2018 at 4:06 PM [email protected] wrote:

Hello FireWorks Users,

Background about my workflow

I am using fireworks on LBNL’s NERSC Cori system. I have a workflow which contains ~50,000 fireworks with no interdependencies. Each firework consists of two firetasks:

  1. Scriptask which executes a parallel program (currently ~3 min cpu time /execution)
  1. Scriptask which runs a small python script that processes some output

I am running on the 68-core KNL nodes so I am using ‘rocket_launch: rlaunch multi 34’ in my qadapter script with 2 threads in my Scriptasks for the parallel program.

Hundreds of thousands of these fireworks will eventually need to be run with an even more expensive version of the firetask #1 software.

Issue

When I only have a few thousand of these types of fireworks in my workflow everything runs perfectly well. However, when I scale up to ~50,000 I begin to get many the following type of error in the FW_job%.out file after ~10 fireworks have successfully completed:

2018-07-05 15:13:52,532 INFO fw_id 63526 locked. Can’t refresh!

I believe the fireworks’ lock attempts are expiring because they have waited past the config file parameter WFLOCK_EXPIRATION_SECS time limit without getting a lock on the DB because there are too many fireworks finishing around the same time and the database updates are apparently taking too long. I have extended the WFLOCK_EXPIRATION_SECS parameter from 5 min to 10 min but this did not solve the problem. In any case, I can’t even afford 5 minutes of downtime between each firework completion.

Setting the WFLOCK_EXPIRATION_KILL to True does not solve my problem because then there are too many fireworks forcing locks on the database.

I have tried the ‘lpad admin maintain --infinite --maintain_interval 60’ command to no avail. I have also poked around with the database profiler but I am not sure what to look at.

I have heard adding indices to the LaunchPad may improve update speed. I know I do this in the format LaunchPad(…, user_indices= ['spec.paramter1],…) but I am not sure what ‘parameter1’ should be.

Is a fix possible or is my workflow already too large?

Please let me know if more information would be helpful.

I would greatly appreciate any advice.

Thank you.

You received this message because you are subscribed to the Google Groups “fireworkflows” group.

To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].

To post to this group, send email to [email protected].

Visit this group at https://groups.google.com/group/fireworkflows.

To view this discussion on the web visit https://groups.google.com/d/msgid/fireworkflows/fb576419-616e-4880-9184-3580b0bd84bd%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.


Best,
Anubhav


Best,
Anubhav

Hi Michael,

That’s great to hear!

Let us know if you encounter any more issues.

Best,

Anubhav

···

Best,
Anubhav