queue launcher crashes with empty queue

Hi,

I’m trying to run a long workflow that dynamically creates new fireworks at every iteration. I’m running the workflow with a queue launcher in infinite mode. Usually after around 5 iterations (50-100 fireworks) the queue launcher crashes as follows:

2017-02-07 22:56:21,500 INFO Sleeping for 5 seconds…zzz…

2017-02-07 22:56:26,592 INFO Launching a rocket!

2017-02-07 22:56:26,616 INFO No jobs exist in the LaunchPad for submission to queue!

2017-02-07 22:56:26,616 ERROR ----|vvv|----

2017-02-07 22:56:26,616 ERROR Error with queue launcher rapid fire!

2017-02-07 22:56:26,618 ERROR Traceback (most recent call last):

File “/atlas/u/jkuck/software/anaconda2/envs/anaconda_venv/lib/python2.7/site-packages/fireworks/queue/queue_launcher.py”, line 216, in rapidfire

raise RuntimeError("Launch unsuccessful!")

RuntimeError: Launch unsuccessful!

2017-02-07 22:56:26,619 ERROR ----|^^^|----

It looks like the queue launcher thinks a firework is ready to launch, but then finds the queue is empty after calling launch_rocket_to_queue(). Any tips would be appreciated!

Thanks,
Jonathan

Just to get a bit more info, does the issue persist when you restart the queue launcher? Also, are you using fill mode?

Best,

Joey

···

On Feb 7, 2017, at 11:30 PM, jkuck [email protected] wrote:

Hi,

I’m trying to run a long workflow that dynamically creates new fireworks at every iteration. I’m running the workflow with a queue launcher in infinite mode. Usually after around 5 iterations (50-100 fireworks) the queue launcher crashes as follows:

2017-02-07 22:56:21,500 INFO Sleeping for 5 seconds…zzz…

2017-02-07 22:56:26,592 INFO Launching a rocket!

2017-02-07 22:56:26,616 INFO No jobs exist in the LaunchPad for submission to queue!

2017-02-07 22:56:26,616 ERROR ----|vvv|----

2017-02-07 22:56:26,616 ERROR Error with queue launcher rapid fire!

2017-02-07 22:56:26,618 ERROR Traceback (most recent call last):

File “/atlas/u/jkuck/software/anaconda2/envs/anaconda_venv/lib/python2.7/site-packages/fireworks/queue/queue_launcher.py”, line 216, in rapidfire

raise RuntimeError("Launch unsuccessful!")

RuntimeError: Launch unsuccessful!

2017-02-07 22:56:26,619 ERROR ----|^^^|----

It looks like the queue launcher thinks a firework is ready to launch, but then finds the queue is empty after calling launch_rocket_to_queue(). Any tips would be appreciated!

Thanks,
Jonathan


You received this message because you are subscribed to the Google Groups “fireworkflows” group.
To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/fireworkflows.
To view this discussion on the web visit https://groups.google.com/d/msgid/fireworkflows/848dd390-ba00-4ad9-8daf-815882c89347%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Yes, the queue launcher crashes again after being restarted. I’m calling the queue launcher with fill_mode=false:

rapidfire(launchpad, FWorker(), qadapter, launch_dir=’.’, nlaunches=‘infinite’, njobs_queue=20,

njobs_block=500, sleep_time=None, reserve=False, strm_lvl=‘INFO’, timeout=None,

fill_mode=False)

Thanks,

Jonathan

···

On Tuesday, February 7, 2017 at 11:51:50 PM UTC-8, Joseph Montoya wrote:

Just to get a bit more info, does the issue persist when you restart the queue launcher? Also, are you using fill mode?

Best,

Joey

On Feb 7, 2017, at 11:30 PM, jkuck [email protected] wrote:

Hi,

I’m trying to run a long workflow that dynamically creates new fireworks at every iteration. I’m running the workflow with a queue launcher in infinite mode. Usually after around 5 iterations (50-100 fireworks) the queue launcher crashes as follows:

2017-02-07 22:56:21,500 INFO Sleeping for 5 seconds…zzz…

2017-02-07 22:56:26,592 INFO Launching a rocket!

2017-02-07 22:56:26,616 INFO No jobs exist in the LaunchPad for submission to queue!

2017-02-07 22:56:26,616 ERROR ----|vvv|----

2017-02-07 22:56:26,616 ERROR Error with queue launcher rapid fire!

2017-02-07 22:56:26,618 ERROR Traceback (most recent call last):

File “/atlas/u/jkuck/software/anaconda2/envs/anaconda_venv/lib/python2.7/site-packages/fireworks/queue/queue_launcher.py”, line 216, in rapidfire

raise RuntimeError("Launch unsuccessful!")

RuntimeError: Launch unsuccessful!

2017-02-07 22:56:26,619 ERROR ----|^^^|----

It looks like the queue launcher thinks a firework is ready to launch, but then finds the queue is empty after calling launch_rocket_to_queue(). Any tips would be appreciated!

Thanks,
Jonathan


You received this message because you are subscribed to the Google Groups “fireworkflows” group.
To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/fireworkflows.
To view this discussion on the web visit https://groups.google.com/d/msgid/fireworkflows/848dd390-ba00-4ad9-8daf-815882c89347%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hi Jonathan

Two things:

  1. Can you paste the output of “lpad get_fws -s READY -d count” after the script crashes?

  2. Would you mind running the script again with strm_lvl=“DEBUG” and pasting the output again?

I haven’t seen or heard of this error before so it might take a little back and forth to figure out what’s happening.

Best,

Anubhav

···

On Tuesday, February 7, 2017 at 11:58:32 PM UTC-8, jkuck wrote:

Yes, the queue launcher crashes again after being restarted. I’m calling the queue launcher with fill_mode=false:

rapidfire(launchpad, FWorker(), qadapter, launch_dir=’.’, nlaunches=‘infinite’, njobs_queue=20,

njobs_block=500, sleep_time=None, reserve=False, strm_lvl=‘INFO’, timeout=None,

fill_mode=False)

Thanks,

Jonathan
On Tuesday, February 7, 2017 at 11:51:50 PM UTC-8, Joseph Montoya wrote:

Just to get a bit more info, does the issue persist when you restart the queue launcher? Also, are you using fill mode?

Best,

Joey

On Feb 7, 2017, at 11:30 PM, jkuck [email protected] wrote:

Hi,

I’m trying to run a long workflow that dynamically creates new fireworks at every iteration. I’m running the workflow with a queue launcher in infinite mode. Usually after around 5 iterations (50-100 fireworks) the queue launcher crashes as follows:

2017-02-07 22:56:21,500 INFO Sleeping for 5 seconds…zzz…

2017-02-07 22:56:26,592 INFO Launching a rocket!

2017-02-07 22:56:26,616 INFO No jobs exist in the LaunchPad for submission to queue!

2017-02-07 22:56:26,616 ERROR ----|vvv|----

2017-02-07 22:56:26,616 ERROR Error with queue launcher rapid fire!

2017-02-07 22:56:26,618 ERROR Traceback (most recent call last):

File “/atlas/u/jkuck/software/anaconda2/envs/anaconda_venv/lib/python2.7/site-packages/fireworks/queue/queue_launcher.py”, line 216, in rapidfire

raise RuntimeError("Launch unsuccessful!")

RuntimeError: Launch unsuccessful!

2017-02-07 22:56:26,619 ERROR ----|^^^|----

It looks like the queue launcher thinks a firework is ready to launch, but then finds the queue is empty after calling launch_rocket_to_queue(). Any tips would be appreciated!

Thanks,
Jonathan


You received this message because you are subscribed to the Google Groups “fireworkflows” group.
To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/fireworkflows.
To view this discussion on the web visit https://groups.google.com/d/msgid/fireworkflows/848dd390-ba00-4ad9-8daf-815882c89347%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hi Anubhav,

Thanks a lot for the help. Here’s the info:

  1. Can you paste the output of “lpad get_fws -s READY -d count” after the script crashes?

I’ve tried this after two crashes now. The first was ‘1’ and the second ‘2’.

  1. Would you mind running the script again with strm_lvl=“DEBUG” and pasting the output again?

Here is the output, I’ve included a successful submission as well:

2017-02-08 10:53:34,428 INFO Job submission was successful and job_id is 1176878

2017-02-08 10:53:34,428 INFO Sleeping for 5 seconds…zzz…

2017-02-08 10:53:39,455 INFO Finished a round of launches, sleeping for 60 secs

2017-02-08 10:54:39,516 INFO Checking for Rockets to run…

2017-02-08 10:54:39,555 INFO The number of jobs currently in the queue is: 0

2017-02-08 10:54:39,555 INFO 0 jobs in queue. Maximum allowed by user: 20

2017-02-08 10:54:39,640 INFO Launching a rocket!

2017-02-08 10:54:39,647 DEBUG getting queue adapter

2017-02-08 10:54:39,733 INFO Created new dir /atlas/u/jkuck/rbpf_fireworks/block_2017-02-08-17-35-21-007249/launcher_2017-02-08-18-54-39-731710

2017-02-08 10:54:39,733 INFO moving to launch_dir /atlas/u/jkuck/rbpf_fireworks/block_2017-02-08-17-35-21-007249/launcher_2017-02-08-18-54-39-731710

2017-02-08 10:54:39,734 DEBUG writing queue script

2017-02-08 10:54:39,740 INFO submitting queue script

2017-02-08 10:54:41,842 INFO Job submission was successful and job_id is 1176879

2017-02-08 10:54:41,843 INFO Sleeping for 5 seconds…zzz…

2017-02-08 10:54:46,933 INFO Launching a rocket!

2017-02-08 10:54:46,940 DEBUG getting queue adapter

2017-02-08 10:54:46,961 INFO No jobs exist in the LaunchPad for submission to queue!

2017-02-08 10:54:46,961 ERROR ----|vvv|----

2017-02-08 10:54:46,962 ERROR Error with queue launcher rapid fire!

2017-02-08 10:54:46,965 ERROR Traceback (most recent call last):

File “/atlas/u/jkuck/software/anaconda2/envs/anaconda_venv/lib/python2.7/site-packages/fireworks/queue/queue_launcher.py”, line 216, in rapidfire

raise RuntimeError("Launch unsuccessful!")

RuntimeError: Launch unsuccessful!

2017-02-08 10:54:46,965 ERROR ----|^^^|----

Best,

Jonathan

···

On Wednesday, February 8, 2017 at 9:56:07 AM UTC-8, Anubhav Jain wrote:

Hi Jonathan

Two things:

  1. Can you paste the output of “lpad get_fws -s READY -d count” after the script crashes?
  1. Would you mind running the script again with strm_lvl=“DEBUG” and pasting the output again?

I haven’t seen or heard of this error before so it might take a little back and forth to figure out what’s happening.

Best,

Anubhav

On Tuesday, February 7, 2017 at 11:58:32 PM UTC-8, jkuck wrote:

Yes, the queue launcher crashes again after being restarted. I’m calling the queue launcher with fill_mode=false:

rapidfire(launchpad, FWorker(), qadapter, launch_dir=’.’, nlaunches=‘infinite’, njobs_queue=20,

njobs_block=500, sleep_time=None, reserve=False, strm_lvl=‘INFO’, timeout=None,

fill_mode=False)

Thanks,

Jonathan
On Tuesday, February 7, 2017 at 11:51:50 PM UTC-8, Joseph Montoya wrote:

Just to get a bit more info, does the issue persist when you restart the queue launcher? Also, are you using fill mode?

Best,

Joey

On Feb 7, 2017, at 11:30 PM, jkuck [email protected] wrote:

Hi,

I’m trying to run a long workflow that dynamically creates new fireworks at every iteration. I’m running the workflow with a queue launcher in infinite mode. Usually after around 5 iterations (50-100 fireworks) the queue launcher crashes as follows:

2017-02-07 22:56:21,500 INFO Sleeping for 5 seconds…zzz…

2017-02-07 22:56:26,592 INFO Launching a rocket!

2017-02-07 22:56:26,616 INFO No jobs exist in the LaunchPad for submission to queue!

2017-02-07 22:56:26,616 ERROR ----|vvv|----

2017-02-07 22:56:26,616 ERROR Error with queue launcher rapid fire!

2017-02-07 22:56:26,618 ERROR Traceback (most recent call last):

File “/atlas/u/jkuck/software/anaconda2/envs/anaconda_venv/lib/python2.7/site-packages/fireworks/queue/queue_launcher.py”, line 216, in rapidfire

raise RuntimeError("Launch unsuccessful!")

RuntimeError: Launch unsuccessful!

2017-02-07 22:56:26,619 ERROR ----|^^^|----

It looks like the queue launcher thinks a firework is ready to launch, but then finds the queue is empty after calling launch_rocket_to_queue(). Any tips would be appreciated!

Thanks,
Jonathan


You received this message because you are subscribed to the Google Groups “fireworkflows” group.
To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/fireworkflows.
To view this discussion on the web visit https://groups.google.com/d/msgid/fireworkflows/848dd390-ba00-4ad9-8daf-815882c89347%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hi Jonathan

Ok, unfortunately that doesn’t provide much additional information, although it does seem like there are READY jobs to run in the queue.

Can you try something else? Immediately after the crash, “cd” to the directory listed by the qlauncher and manually try to submit the script. e.g., for a PBS queue system this would involve typing “qsub FW_submit.script”. The directory is listed in the debug output you printed, e.g., atlas/u/jkuck/rbpf_fireworks/block_2017-02-08-17-35-21-007249/launcher_2017-02-08-18-54-39-731710 for the previous time you tried this.

Sometime, manually submitting the script can help clarify what errors (if any) are being thrown by the queuing system.

Best,

Anubhav

···

On Wednesday, February 8, 2017 at 11:10:11 AM UTC-8, jkuck wrote:

Hi Anubhav,

Thanks a lot for the help. Here’s the info:

  1. Can you paste the output of “lpad get_fws -s READY -d count” after the script crashes?

I’ve tried this after two crashes now. The first was ‘1’ and the second ‘2’.

  1. Would you mind running the script again with strm_lvl=“DEBUG” and pasting the output again?

Here is the output, I’ve included a successful submission as well:

2017-02-08 10:53:34,428 INFO Job submission was successful and job_id is 1176878

2017-02-08 10:53:34,428 INFO Sleeping for 5 seconds…zzz…

2017-02-08 10:53:39,455 INFO Finished a round of launches, sleeping for 60 secs

2017-02-08 10:54:39,516 INFO Checking for Rockets to run…

2017-02-08 10:54:39,555 INFO The number of jobs currently in the queue is: 0

2017-02-08 10:54:39,555 INFO 0 jobs in queue. Maximum allowed by user: 20

2017-02-08 10:54:39,640 INFO Launching a rocket!

2017-02-08 10:54:39,647 DEBUG getting queue adapter

2017-02-08 10:54:39,733 INFO Created new dir /atlas/u/jkuck/rbpf_fireworks/block_2017-02-08-17-35-21-007249/launcher_2017-02-08-18-54-39-731710

2017-02-08 10:54:39,733 INFO moving to launch_dir /atlas/u/jkuck/rbpf_fireworks/block_2017-02-08-17-35-21-007249/launcher_2017-02-08-18-54-39-731710

2017-02-08 10:54:39,734 DEBUG writing queue script

2017-02-08 10:54:39,740 INFO submitting queue script

2017-02-08 10:54:41,842 INFO Job submission was successful and job_id is 1176879

2017-02-08 10:54:41,843 INFO Sleeping for 5 seconds…zzz…

2017-02-08 10:54:46,933 INFO Launching a rocket!

2017-02-08 10:54:46,940 DEBUG getting queue adapter

2017-02-08 10:54:46,961 INFO No jobs exist in the LaunchPad for submission to queue!

2017-02-08 10:54:46,961 ERROR ----|vvv|----

2017-02-08 10:54:46,962 ERROR Error with queue launcher rapid fire!

2017-02-08 10:54:46,965 ERROR Traceback (most recent call last):

File “/atlas/u/jkuck/software/anaconda2/envs/anaconda_venv/lib/python2.7/site-packages/fireworks/queue/queue_launcher.py”, line 216, in rapidfire

raise RuntimeError("Launch unsuccessful!")

RuntimeError: Launch unsuccessful!

2017-02-08 10:54:46,965 ERROR ----|^^^|----

Best,

Jonathan

On Wednesday, February 8, 2017 at 9:56:07 AM UTC-8, Anubhav Jain wrote:

Hi Jonathan

Two things:

  1. Can you paste the output of “lpad get_fws -s READY -d count” after the script crashes?
  1. Would you mind running the script again with strm_lvl=“DEBUG” and pasting the output again?

I haven’t seen or heard of this error before so it might take a little back and forth to figure out what’s happening.

Best,

Anubhav

On Tuesday, February 7, 2017 at 11:58:32 PM UTC-8, jkuck wrote:

Yes, the queue launcher crashes again after being restarted. I’m calling the queue launcher with fill_mode=false:

rapidfire(launchpad, FWorker(), qadapter, launch_dir=’.’, nlaunches=‘infinite’, njobs_queue=20,

njobs_block=500, sleep_time=None, reserve=False, strm_lvl=‘INFO’, timeout=None,

fill_mode=False)

Thanks,

Jonathan
On Tuesday, February 7, 2017 at 11:51:50 PM UTC-8, Joseph Montoya wrote:

Just to get a bit more info, does the issue persist when you restart the queue launcher? Also, are you using fill mode?

Best,

Joey

On Feb 7, 2017, at 11:30 PM, jkuck [email protected] wrote:

Hi,

I’m trying to run a long workflow that dynamically creates new fireworks at every iteration. I’m running the workflow with a queue launcher in infinite mode. Usually after around 5 iterations (50-100 fireworks) the queue launcher crashes as follows:

2017-02-07 22:56:21,500 INFO Sleeping for 5 seconds…zzz…

2017-02-07 22:56:26,592 INFO Launching a rocket!

2017-02-07 22:56:26,616 INFO No jobs exist in the LaunchPad for submission to queue!

2017-02-07 22:56:26,616 ERROR ----|vvv|----

2017-02-07 22:56:26,616 ERROR Error with queue launcher rapid fire!

2017-02-07 22:56:26,618 ERROR Traceback (most recent call last):

File “/atlas/u/jkuck/software/anaconda2/envs/anaconda_venv/lib/python2.7/site-packages/fireworks/queue/queue_launcher.py”, line 216, in rapidfire

raise RuntimeError("Launch unsuccessful!")

RuntimeError: Launch unsuccessful!

2017-02-07 22:56:26,619 ERROR ----|^^^|----

It looks like the queue launcher thinks a firework is ready to launch, but then finds the queue is empty after calling launch_rocket_to_queue(). Any tips would be appreciated!

Thanks,
Jonathan


You received this message because you are subscribed to the Google Groups “fireworkflows” group.
To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/fireworkflows.
To view this discussion on the web visit https://groups.google.com/d/msgid/fireworkflows/848dd390-ba00-4ad9-8daf-815882c89347%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hi Anubhav,

Correct me if I’m wrong, but I think the queue launcher is crashing before creating the launch directory. It looks like 'atlas/u/jkuck/rbpf_fireworks/block_2017-02-08-17-35-21-007249/launcher_2017-02-08-18-54-39-731710 ’ is the directory created by the successful submission.

The problem seems to be that somehow launchpad.run_exists(fworker) is evaluating to True in the while loop in rapidfire() in queue_launcher.py, but then false in launch_rocket_to_queue().

Best,
Jonathan

···

On Wed, Feb 8, 2017 at 1:32 PM, Anubhav Jain [email protected] wrote:

Hi Jonathan

Ok, unfortunately that doesn’t provide much additional information, although it does seem like there are READY jobs to run in the queue.

Can you try something else? Immediately after the crash, “cd” to the directory listed by the qlauncher and manually try to submit the script. e.g., for a PBS queue system this would involve typing “qsub FW_submit.script”. The directory is listed in the debug output you printed, e.g., atlas/u/jkuck/rbpf_fireworks/block_2017-02-08-17-35-21-007249/launcher_2017-02-08-18-54-39-731710 for the previous time you tried this.

Sometime, manually submitting the script can help clarify what errors (if any) are being thrown by the queuing system.

Best,

Anubhav

On Wednesday, February 8, 2017 at 11:10:11 AM UTC-8, jkuck wrote:

Hi Anubhav,

Thanks a lot for the help. Here’s the info:

  1. Can you paste the output of “lpad get_fws -s READY -d count” after the script crashes?

I’ve tried this after two crashes now. The first was ‘1’ and the second ‘2’.

  1. Would you mind running the script again with strm_lvl=“DEBUG” and pasting the output again?

Here is the output, I’ve included a successful submission as well:

2017-02-08 10:53:34,428 INFO Job submission was successful and job_id is 1176878

2017-02-08 10:53:34,428 INFO Sleeping for 5 seconds…zzz…

2017-02-08 10:53:39,455 INFO Finished a round of launches, sleeping for 60 secs

2017-02-08 10:54:39,516 INFO Checking for Rockets to run…

2017-02-08 10:54:39,555 INFO The number of jobs currently in the queue is: 0

2017-02-08 10:54:39,555 INFO 0 jobs in queue. Maximum allowed by user: 20

2017-02-08 10:54:39,640 INFO Launching a rocket!

2017-02-08 10:54:39,647 DEBUG getting queue adapter

2017-02-08 10:54:39,733 INFO Created new dir /atlas/u/jkuck/rbpf_fireworks/block_2017-02-08-17-35-21-007249/launcher_2017-02-08-18-54-39-731710

2017-02-08 10:54:39,733 INFO moving to launch_dir /atlas/u/jkuck/rbpf_fireworks/block_2017-02-08-17-35-21-007249/launcher_2017-02-08-18-54-39-731710

2017-02-08 10:54:39,734 DEBUG writing queue script

2017-02-08 10:54:39,740 INFO submitting queue script

2017-02-08 10:54:41,842 INFO Job submission was successful and job_id is 1176879

2017-02-08 10:54:41,843 INFO Sleeping for 5 seconds…zzz…

2017-02-08 10:54:46,933 INFO Launching a rocket!

2017-02-08 10:54:46,940 DEBUG getting queue adapter

2017-02-08 10:54:46,961 INFO No jobs exist in the LaunchPad for submission to queue!

2017-02-08 10:54:46,961 ERROR ----|vvv|----

2017-02-08 10:54:46,962 ERROR Error with queue launcher rapid fire!

2017-02-08 10:54:46,965 ERROR Traceback (most recent call last):

File “/atlas/u/jkuck/software/anaconda2/envs/anaconda_venv/lib/python2.7/site-packages/fireworks/queue/queue_launcher.py”, line 216, in rapidfire

raise RuntimeError("Launch unsuccessful!")

RuntimeError: Launch unsuccessful!

2017-02-08 10:54:46,965 ERROR ----|^^^|----

Best,

Jonathan

On Wednesday, February 8, 2017 at 9:56:07 AM UTC-8, Anubhav Jain wrote:

Hi Jonathan

Two things:

  1. Can you paste the output of “lpad get_fws -s READY -d count” after the script crashes?
  1. Would you mind running the script again with strm_lvl=“DEBUG” and pasting the output again?

I haven’t seen or heard of this error before so it might take a little back and forth to figure out what’s happening.

Best,

Anubhav

On Tuesday, February 7, 2017 at 11:58:32 PM UTC-8, jkuck wrote:

Yes, the queue launcher crashes again after being restarted. I’m calling the queue launcher with fill_mode=false:

rapidfire(launchpad, FWorker(), qadapter, launch_dir=’.’, nlaunches=‘infinite’, njobs_queue=20,

njobs_block=500, sleep_time=None, reserve=False, strm_lvl=‘INFO’, timeout=None,

fill_mode=False)

Thanks,

Jonathan
On Tuesday, February 7, 2017 at 11:51:50 PM UTC-8, Joseph Montoya wrote:

Just to get a bit more info, does the issue persist when you restart the queue launcher? Also, are you using fill mode?

Best,

Joey

On Feb 7, 2017, at 11:30 PM, jkuck [email protected] wrote:

Hi,

I’m trying to run a long workflow that dynamically creates new fireworks at every iteration. I’m running the workflow with a queue launcher in infinite mode. Usually after around 5 iterations (50-100 fireworks) the queue launcher crashes as follows:

2017-02-07 22:56:21,500 INFO Sleeping for 5 seconds…zzz…

2017-02-07 22:56:26,592 INFO Launching a rocket!

2017-02-07 22:56:26,616 INFO No jobs exist in the LaunchPad for submission to queue!

2017-02-07 22:56:26,616 ERROR ----|vvv|----

2017-02-07 22:56:26,616 ERROR Error with queue launcher rapid fire!

2017-02-07 22:56:26,618 ERROR Traceback (most recent call last):

File “/atlas/u/jkuck/software/anaconda2/envs/anaconda_venv/lib/python2.7/site-packages/fireworks/queue/queue_launcher.py”, line 216, in rapidfire

raise RuntimeError("Launch unsuccessful!")

RuntimeError: Launch unsuccessful!

2017-02-07 22:56:26,619 ERROR ----|^^^|----

It looks like the queue launcher thinks a firework is ready to launch, but then finds the queue is empty after calling launch_rocket_to_queue(). Any tips would be appreciated!

Thanks,
Jonathan


You received this message because you are subscribed to the Google Groups “fireworkflows” group.
To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/fireworkflows.
To view this discussion on the web visit https://groups.google.com/d/msgid/fireworkflows/848dd390-ba00-4ad9-8daf-815882c89347%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

A bit more info, I’ve replicated the problem on a second cluster with the latest version of fireworks installed (1.4.0). The second cluster uses slurm instead of pbs. The only observable difference is the line where the error occurs, because I’m running the new version of fireworks:

2017-02-08 17:11:05,635 INFO Launching a rocket!

2017-02-08 17:11:05,637 DEBUG getting queue adapter

2017-02-08 17:11:05,673 INFO No jobs exist in the LaunchPad for submission to queue!

2017-02-08 17:11:05,673 ERROR ----|vvv|----

2017-02-08 17:11:05,673 ERROR Error with queue launcher rapid fire!

2017-02-08 17:11:05,674 ERROR Traceback (most recent call last):

File “/home/kuck/.local/lib/python2.7/site-packages/fireworks/queue/queue_launcher.py”, line 221, in rapidfire

raise RuntimeError("Launch unsuccessful!")

RuntimeError: Launch unsuccessful!

2017-02-08 17:11:05,675 ERROR ----|^^^|----

Best,

Jonathan

···

On Wednesday, February 8, 2017 at 1:49:45 PM UTC-8, jkuck wrote:

Hi Anubhav,

Correct me if I’m wrong, but I think the queue launcher is crashing before creating the launch directory. It looks like 'atlas/u/jkuck/rbpf_fireworks/block_2017-02-08-17-35-21-007249/launcher_2017-02-08-18-54-39-731710 ’ is the directory created by the successful submission.

The problem seems to be that somehow launchpad.run_exists(fworker) is evaluating to True in the while loop in rapidfire() in queue_launcher.py, but then false in launch_rocket_to_queue().

Best,
Jonathan

On Wed, Feb 8, 2017 at 1:32 PM, Anubhav Jain wrote:

Hi Jonathan

Ok, unfortunately that doesn’t provide much additional information, although it does seem like there are READY jobs to run in the queue.

Can you try something else? Immediately after the crash, “cd” to the directory listed by the qlauncher and manually try to submit the script. e.g., for a PBS queue system this would involve typing “qsub FW_submit.script”. The directory is listed in the debug output you printed, e.g., atlas/u/jkuck/rbpf_fireworks/block_2017-02-08-17-35-21-007249/launcher_2017-02-08-18-54-39-731710 for the previous time you tried this.

Sometime, manually submitting the script can help clarify what errors (if any) are being thrown by the queuing system.

Best,

Anubhav

On Wednesday, February 8, 2017 at 11:10:11 AM UTC-8, jkuck wrote:

Hi Anubhav,

Thanks a lot for the help. Here’s the info:

  1. Can you paste the output of “lpad get_fws -s READY -d count” after the script crashes?

I’ve tried this after two crashes now. The first was ‘1’ and the second ‘2’.

  1. Would you mind running the script again with strm_lvl=“DEBUG” and pasting the output again?

Here is the output, I’ve included a successful submission as well:

2017-02-08 10:53:34,428 INFO Job submission was successful and job_id is 1176878

2017-02-08 10:53:34,428 INFO Sleeping for 5 seconds…zzz…

2017-02-08 10:53:39,455 INFO Finished a round of launches, sleeping for 60 secs

2017-02-08 10:54:39,516 INFO Checking for Rockets to run…

2017-02-08 10:54:39,555 INFO The number of jobs currently in the queue is: 0

2017-02-08 10:54:39,555 INFO 0 jobs in queue. Maximum allowed by user: 20

2017-02-08 10:54:39,640 INFO Launching a rocket!

2017-02-08 10:54:39,647 DEBUG getting queue adapter

2017-02-08 10:54:39,733 INFO Created new dir /atlas/u/jkuck/rbpf_fireworks/block_2017-02-08-17-35-21-007249/launcher_2017-02-08-18-54-39-731710

2017-02-08 10:54:39,733 INFO moving to launch_dir /atlas/u/jkuck/rbpf_fireworks/block_2017-02-08-17-35-21-007249/launcher_2017-02-08-18-54-39-731710

2017-02-08 10:54:39,734 DEBUG writing queue script

2017-02-08 10:54:39,740 INFO submitting queue script

2017-02-08 10:54:41,842 INFO Job submission was successful and job_id is 1176879

2017-02-08 10:54:41,843 INFO Sleeping for 5 seconds…zzz…

2017-02-08 10:54:46,933 INFO Launching a rocket!

2017-02-08 10:54:46,940 DEBUG getting queue adapter

2017-02-08 10:54:46,961 INFO No jobs exist in the LaunchPad for submission to queue!

2017-02-08 10:54:46,961 ERROR ----|vvv|----

2017-02-08 10:54:46,962 ERROR Error with queue launcher rapid fire!

2017-02-08 10:54:46,965 ERROR Traceback (most recent call last):

File “/atlas/u/jkuck/software/anaconda2/envs/anaconda_venv/lib/python2.7/site-packages/fireworks/queue/queue_launcher.py”, line 216, in rapidfire

raise RuntimeError("Launch unsuccessful!")

RuntimeError: Launch unsuccessful!

2017-02-08 10:54:46,965 ERROR ----|^^^|----

Best,

Jonathan

On Wednesday, February 8, 2017 at 9:56:07 AM UTC-8, Anubhav Jain wrote:

Hi Jonathan

Two things:

  1. Can you paste the output of “lpad get_fws -s READY -d count” after the script crashes?
  1. Would you mind running the script again with strm_lvl=“DEBUG” and pasting the output again?

I haven’t seen or heard of this error before so it might take a little back and forth to figure out what’s happening.

Best,

Anubhav

On Tuesday, February 7, 2017 at 11:58:32 PM UTC-8, jkuck wrote:

Yes, the queue launcher crashes again after being restarted. I’m calling the queue launcher with fill_mode=false:

rapidfire(launchpad, FWorker(), qadapter, launch_dir=’.’, nlaunches=‘infinite’, njobs_queue=20,

njobs_block=500, sleep_time=None, reserve=False, strm_lvl=‘INFO’, timeout=None,

fill_mode=False)

Thanks,

Jonathan
On Tuesday, February 7, 2017 at 11:51:50 PM UTC-8, Joseph Montoya wrote:

Just to get a bit more info, does the issue persist when you restart the queue launcher? Also, are you using fill mode?

Best,

Joey

On Feb 7, 2017, at 11:30 PM, jkuck [email protected] wrote:

Hi,

I’m trying to run a long workflow that dynamically creates new fireworks at every iteration. I’m running the workflow with a queue launcher in infinite mode. Usually after around 5 iterations (50-100 fireworks) the queue launcher crashes as follows:

2017-02-07 22:56:21,500 INFO Sleeping for 5 seconds…zzz…

2017-02-07 22:56:26,592 INFO Launching a rocket!

2017-02-07 22:56:26,616 INFO No jobs exist in the LaunchPad for submission to queue!

2017-02-07 22:56:26,616 ERROR ----|vvv|----

2017-02-07 22:56:26,616 ERROR Error with queue launcher rapid fire!

2017-02-07 22:56:26,618 ERROR Traceback (most recent call last):

File “/atlas/u/jkuck/software/anaconda2/envs/anaconda_venv/lib/python2.7/site-packages/fireworks/queue/queue_launcher.py”, line 216, in rapidfire

raise RuntimeError("Launch unsuccessful!")

RuntimeError: Launch unsuccessful!

2017-02-07 22:56:26,619 ERROR ----|^^^|----

It looks like the queue launcher thinks a firework is ready to launch, but then finds the queue is empty after calling launch_rocket_to_queue(). Any tips would be appreciated!

Thanks,
Jonathan


You received this message because you are subscribed to the Google Groups “fireworkflows” group.
To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/fireworkflows.
To view this discussion on the web visit https://groups.google.com/d/msgid/fireworkflows/848dd390-ba00-4ad9-8daf-815882c89347%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hi Jonathan

Thanks for the update- I’ll take a closer look tomorrow.

Best

Anubhav

···

On Wed, Feb 8, 2017 at 5:44 PM, jkuck [email protected] wrote:

A bit more info, I’ve replicated the problem on a second cluster with the latest version of fireworks installed (1.4.0). The second cluster uses slurm instead of pbs. The only observable difference is the line where the error occurs, because I’m running the new version of fireworks:

2017-02-08 17:11:05,635 INFO Launching a rocket!

2017-02-08 17:11:05,637 DEBUG getting queue adapter

2017-02-08 17:11:05,673 INFO No jobs exist in the LaunchPad for submission to queue!

2017-02-08 17:11:05,673 ERROR ----|vvv|----

2017-02-08 17:11:05,673 ERROR Error with queue launcher rapid fire!

2017-02-08 17:11:05,674 ERROR Traceback (most recent call last):

File “/home/kuck/.local/lib/python2.7/site-packages/fireworks/queue/queue_launcher.py”, line 221, in rapidfire

raise RuntimeError("Launch unsuccessful!")

RuntimeError: Launch unsuccessful!

2017-02-08 17:11:05,675 ERROR ----|^^^|----

Best,

Jonathan

On Wednesday, February 8, 2017 at 1:49:45 PM UTC-8, jkuck wrote:

Hi Anubhav,

Correct me if I’m wrong, but I think the queue launcher is crashing before creating the launch directory. It looks like 'atlas/u/jkuck/rbpf_fireworks/block_2017-02-08-17-35-21-007249/launcher_2017-02-08-18-54-39-731710 ’ is the directory created by the successful submission.

The problem seems to be that somehow launchpad.run_exists(fworker) is evaluating to True in the while loop in rapidfire() in queue_launcher.py, but then false in launch_rocket_to_queue().

Best,
Jonathan

On Wed, Feb 8, 2017 at 1:32 PM, Anubhav Jain wrote:

Hi Jonathan

Ok, unfortunately that doesn’t provide much additional information, although it does seem like there are READY jobs to run in the queue.

Can you try something else? Immediately after the crash, “cd” to the directory listed by the qlauncher and manually try to submit the script. e.g., for a PBS queue system this would involve typing “qsub FW_submit.script”. The directory is listed in the debug output you printed, e.g., atlas/u/jkuck/rbpf_fireworks/block_2017-02-08-17-35-21-007249/launcher_2017-02-08-18-54-39-731710 for the previous time you tried this.

Sometime, manually submitting the script can help clarify what errors (if any) are being thrown by the queuing system.

Best,

Anubhav

On Wednesday, February 8, 2017 at 11:10:11 AM UTC-8, jkuck wrote:

Hi Anubhav,

Thanks a lot for the help. Here’s the info:

  1. Can you paste the output of “lpad get_fws -s READY -d count” after the script crashes?

I’ve tried this after two crashes now. The first was ‘1’ and the second ‘2’.

  1. Would you mind running the script again with strm_lvl=“DEBUG” and pasting the output again?

Here is the output, I’ve included a successful submission as well:

2017-02-08 10:53:34,428 INFO Job submission was successful and job_id is 1176878

2017-02-08 10:53:34,428 INFO Sleeping for 5 seconds…zzz…

2017-02-08 10:53:39,455 INFO Finished a round of launches, sleeping for 60 secs

2017-02-08 10:54:39,516 INFO Checking for Rockets to run…

2017-02-08 10:54:39,555 INFO The number of jobs currently in the queue is: 0

2017-02-08 10:54:39,555 INFO 0 jobs in queue. Maximum allowed by user: 20

2017-02-08 10:54:39,640 INFO Launching a rocket!

2017-02-08 10:54:39,647 DEBUG getting queue adapter

2017-02-08 10:54:39,733 INFO Created new dir /atlas/u/jkuck/rbpf_fireworks/block_2017-02-08-17-35-21-007249/launcher_2017-02-08-18-54-39-731710

2017-02-08 10:54:39,733 INFO moving to launch_dir /atlas/u/jkuck/rbpf_fireworks/block_2017-02-08-17-35-21-007249/launcher_2017-02-08-18-54-39-731710

2017-02-08 10:54:39,734 DEBUG writing queue script

2017-02-08 10:54:39,740 INFO submitting queue script

2017-02-08 10:54:41,842 INFO Job submission was successful and job_id is 1176879

2017-02-08 10:54:41,843 INFO Sleeping for 5 seconds…zzz…

2017-02-08 10:54:46,933 INFO Launching a rocket!

2017-02-08 10:54:46,940 DEBUG getting queue adapter

2017-02-08 10:54:46,961 INFO No jobs exist in the LaunchPad for submission to queue!

2017-02-08 10:54:46,961 ERROR ----|vvv|----

2017-02-08 10:54:46,962 ERROR Error with queue launcher rapid fire!

2017-02-08 10:54:46,965 ERROR Traceback (most recent call last):

File “/atlas/u/jkuck/software/anaconda2/envs/anaconda_venv/lib/python2.7/site-packages/fireworks/queue/queue_launcher.py”, line 216, in rapidfire

raise RuntimeError("Launch unsuccessful!")

RuntimeError: Launch unsuccessful!

2017-02-08 10:54:46,965 ERROR ----|^^^|----

Best,

Jonathan

On Wednesday, February 8, 2017 at 9:56:07 AM UTC-8, Anubhav Jain wrote:

Hi Jonathan

Two things:

  1. Can you paste the output of “lpad get_fws -s READY -d count” after the script crashes?
  1. Would you mind running the script again with strm_lvl=“DEBUG” and pasting the output again?

I haven’t seen or heard of this error before so it might take a little back and forth to figure out what’s happening.

Best,

Anubhav

On Tuesday, February 7, 2017 at 11:58:32 PM UTC-8, jkuck wrote:

Yes, the queue launcher crashes again after being restarted. I’m calling the queue launcher with fill_mode=false:

rapidfire(launchpad, FWorker(), qadapter, launch_dir=’.’, nlaunches=‘infinite’, njobs_queue=20,

njobs_block=500, sleep_time=None, reserve=False, strm_lvl=‘INFO’, timeout=None,

fill_mode=False)

Thanks,

Jonathan
On Tuesday, February 7, 2017 at 11:51:50 PM UTC-8, Joseph Montoya wrote:

Just to get a bit more info, does the issue persist when you restart the queue launcher? Also, are you using fill mode?

Best,

Joey

On Feb 7, 2017, at 11:30 PM, jkuck [email protected] wrote:

Hi,

I’m trying to run a long workflow that dynamically creates new fireworks at every iteration. I’m running the workflow with a queue launcher in infinite mode. Usually after around 5 iterations (50-100 fireworks) the queue launcher crashes as follows:

2017-02-07 22:56:21,500 INFO Sleeping for 5 seconds…zzz…

2017-02-07 22:56:26,592 INFO Launching a rocket!

2017-02-07 22:56:26,616 INFO No jobs exist in the LaunchPad for submission to queue!

2017-02-07 22:56:26,616 ERROR ----|vvv|----

2017-02-07 22:56:26,616 ERROR Error with queue launcher rapid fire!

2017-02-07 22:56:26,618 ERROR Traceback (most recent call last):

File “/atlas/u/jkuck/software/anaconda2/envs/anaconda_venv/lib/python2.7/site-packages/fireworks/queue/queue_launcher.py”, line 216, in rapidfire

raise RuntimeError("Launch unsuccessful!")

RuntimeError: Launch unsuccessful!

2017-02-07 22:56:26,619 ERROR ----|^^^|----

It looks like the queue launcher thinks a firework is ready to launch, but then finds the queue is empty after calling launch_rocket_to_queue(). Any tips would be appreciated!

Thanks,
Jonathan


You received this message because you are subscribed to the Google Groups “fireworkflows” group.
To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/fireworkflows.
To view this discussion on the web visit https://groups.google.com/d/msgid/fireworkflows/848dd390-ba00-4ad9-8daf-815882c89347%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

You received this message because you are subscribed to the Google Groups “fireworkflows” group.

For more options, visit https://groups.google.com/d/optout.

To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].

To post to this group, send email to [email protected].

Visit this group at https://groups.google.com/group/fireworkflows.

To view this discussion on the web visit https://groups.google.com/d/msgid/fireworkflows/774f992d-1ece-4152-9f0d-88db91cbe1e3%40googlegroups.com.

Best,
Anubhav

Hi Jonathan,

I am looking over this issue again.

  • I agree with you that if “launchpad.run_exists()” evaluates to True inside the rapidfire() method but then the same method evaluates to False within the “launch_rocket_to_queue” a short while later, that you would see the error trace that you mentioned

  • This could certainly happen if, for example the following sequence of events occurred:

  • There is a READY job in the LaunchPad

  • that job gets submitted to the queue successfully, but it is still READY since we are not in reservation mode (that’s fine)

  • the rapidfire() code loops again, again sees the (same) READY job in the LaunchPad and goes ahead with calling the launch_rocket_to_queue() method to submit another queue job.

  • However, before the launch_rocket_to_queue() gets to the part where it checks again for the existence of a job (launchpad.run_exists()), the job already queued has started RUNNING. Thus, in between the two calls to launchpad.run_exists(), the FW went from READY to RUNNING leaving behind no jobs to run when the second call happened.

Do you think this is the sequence of steps that is occurring?

I see two ways forward

  1. If a ready job “disappears” by the time a queue is going to be submitted, simply consider the current iteration of rapidfire() to be finished

  2. Try to count jobs so that the same READY job doesn’t lead to 2+ queue submissions. This would potentially have some benefits in creating 1:1 mappings of jobs to queue submissions, although it would be very difficult to prevent two simultaneous qlaunch processes (e.g. on different machines/workers) from colliding.

Solution (1) is certainly easier to do and so I implemented it.

Please try FWS v1.4.1 (just released) and let me know if this fixes it.

Anubhav

···

On Thursday, February 9, 2017 at 8:47:37 AM UTC-8, ajain wrote:

Hi Jonathan

Thanks for the update- I’ll take a closer look tomorrow.

Best

Anubhav

On Wed, Feb 8, 2017 at 5:44 PM, jkuck [email protected] wrote:

A bit more info, I’ve replicated the problem on a second cluster with the latest version of fireworks installed (1.4.0). The second cluster uses slurm instead of pbs. The only observable difference is the line where the error occurs, because I’m running the new version of fireworks:

2017-02-08 17:11:05,635 INFO Launching a rocket!

2017-02-08 17:11:05,637 DEBUG getting queue adapter

2017-02-08 17:11:05,673 INFO No jobs exist in the LaunchPad for submission to queue!

2017-02-08 17:11:05,673 ERROR ----|vvv|----

2017-02-08 17:11:05,673 ERROR Error with queue launcher rapid fire!

2017-02-08 17:11:05,674 ERROR Traceback (most recent call last):

File “/home/kuck/.local/lib/python2.7/site-packages/fireworks/queue/queue_launcher.py”, line 221, in rapidfire

raise RuntimeError("Launch unsuccessful!")

RuntimeError: Launch unsuccessful!

2017-02-08 17:11:05,675 ERROR ----|^^^|----

Best,

Jonathan

On Wednesday, February 8, 2017 at 1:49:45 PM UTC-8, jkuck wrote:

Hi Anubhav,

Correct me if I’m wrong, but I think the queue launcher is crashing before creating the launch directory. It looks like 'atlas/u/jkuck/rbpf_fireworks/block_2017-02-08-17-35-21-007249/launcher_2017-02-08-18-54-39-731710 ’ is the directory created by the successful submission.

The problem seems to be that somehow launchpad.run_exists(fworker) is evaluating to True in the while loop in rapidfire() in queue_launcher.py, but then false in launch_rocket_to_queue().

Best,
Jonathan

On Wed, Feb 8, 2017 at 1:32 PM, Anubhav Jain wrote:

Hi Jonathan

Ok, unfortunately that doesn’t provide much additional information, although it does seem like there are READY jobs to run in the queue.

Can you try something else? Immediately after the crash, “cd” to the directory listed by the qlauncher and manually try to submit the script. e.g., for a PBS queue system this would involve typing “qsub FW_submit.script”. The directory is listed in the debug output you printed, e.g., atlas/u/jkuck/rbpf_fireworks/block_2017-02-08-17-35-21-007249/launcher_2017-02-08-18-54-39-731710 for the previous time you tried this.

Sometime, manually submitting the script can help clarify what errors (if any) are being thrown by the queuing system.

Best,

Anubhav

On Wednesday, February 8, 2017 at 11:10:11 AM UTC-8, jkuck wrote:

Hi Anubhav,

Thanks a lot for the help. Here’s the info:

  1. Can you paste the output of “lpad get_fws -s READY -d count” after the script crashes?

I’ve tried this after two crashes now. The first was ‘1’ and the second ‘2’.

  1. Would you mind running the script again with strm_lvl=“DEBUG” and pasting the output again?

Here is the output, I’ve included a successful submission as well:

2017-02-08 10:53:34,428 INFO Job submission was successful and job_id is 1176878

2017-02-08 10:53:34,428 INFO Sleeping for 5 seconds…zzz…

2017-02-08 10:53:39,455 INFO Finished a round of launches, sleeping for 60 secs

2017-02-08 10:54:39,516 INFO Checking for Rockets to run…

2017-02-08 10:54:39,555 INFO The number of jobs currently in the queue is: 0

2017-02-08 10:54:39,555 INFO 0 jobs in queue. Maximum allowed by user: 20

2017-02-08 10:54:39,640 INFO Launching a rocket!

2017-02-08 10:54:39,647 DEBUG getting queue adapter

2017-02-08 10:54:39,733 INFO Created new dir /atlas/u/jkuck/rbpf_fireworks/block_2017-02-08-17-35-21-007249/launcher_2017-02-08-18-54-39-731710

2017-02-08 10:54:39,733 INFO moving to launch_dir /atlas/u/jkuck/rbpf_fireworks/block_2017-02-08-17-35-21-007249/launcher_2017-02-08-18-54-39-731710

2017-02-08 10:54:39,734 DEBUG writing queue script

2017-02-08 10:54:39,740 INFO submitting queue script

2017-02-08 10:54:41,842 INFO Job submission was successful and job_id is 1176879

2017-02-08 10:54:41,843 INFO Sleeping for 5 seconds…zzz…

2017-02-08 10:54:46,933 INFO Launching a rocket!

2017-02-08 10:54:46,940 DEBUG getting queue adapter

2017-02-08 10:54:46,961 INFO No jobs exist in the LaunchPad for submission to queue!

2017-02-08 10:54:46,961 ERROR ----|vvv|----

2017-02-08 10:54:46,962 ERROR Error with queue launcher rapid fire!

2017-02-08 10:54:46,965 ERROR Traceback (most recent call last):

File “/atlas/u/jkuck/software/anaconda2/envs/anaconda_venv/lib/python2.7/site-packages/fireworks/queue/queue_launcher.py”, line 216, in rapidfire

raise RuntimeError("Launch unsuccessful!")

RuntimeError: Launch unsuccessful!

2017-02-08 10:54:46,965 ERROR ----|^^^|----

Best,

Jonathan

On Wednesday, February 8, 2017 at 9:56:07 AM UTC-8, Anubhav Jain wrote:

Hi Jonathan

Two things:

  1. Can you paste the output of “lpad get_fws -s READY -d count” after the script crashes?
  1. Would you mind running the script again with strm_lvl=“DEBUG” and pasting the output again?

I haven’t seen or heard of this error before so it might take a little back and forth to figure out what’s happening.

Best,

Anubhav

On Tuesday, February 7, 2017 at 11:58:32 PM UTC-8, jkuck wrote:

Yes, the queue launcher crashes again after being restarted. I’m calling the queue launcher with fill_mode=false:

rapidfire(launchpad, FWorker(), qadapter, launch_dir=’.’, nlaunches=‘infinite’, njobs_queue=20,

njobs_block=500, sleep_time=None, reserve=False, strm_lvl=‘INFO’, timeout=None,

fill_mode=False)

Thanks,

Jonathan
On Tuesday, February 7, 2017 at 11:51:50 PM UTC-8, Joseph Montoya wrote:

Just to get a bit more info, does the issue persist when you restart the queue launcher? Also, are you using fill mode?

Best,

Joey

On Feb 7, 2017, at 11:30 PM, jkuck [email protected] wrote:

Hi,

I’m trying to run a long workflow that dynamically creates new fireworks at every iteration. I’m running the workflow with a queue launcher in infinite mode. Usually after around 5 iterations (50-100 fireworks) the queue launcher crashes as follows:

2017-02-07 22:56:21,500 INFO Sleeping for 5 seconds…zzz…

2017-02-07 22:56:26,592 INFO Launching a rocket!

2017-02-07 22:56:26,616 INFO No jobs exist in the LaunchPad for submission to queue!

2017-02-07 22:56:26,616 ERROR ----|vvv|----

2017-02-07 22:56:26,616 ERROR Error with queue launcher rapid fire!

2017-02-07 22:56:26,618 ERROR Traceback (most recent call last):

File “/atlas/u/jkuck/software/anaconda2/envs/anaconda_venv/lib/python2.7/site-packages/fireworks/queue/queue_launcher.py”, line 216, in rapidfire

raise RuntimeError("Launch unsuccessful!")

RuntimeError: Launch unsuccessful!

2017-02-07 22:56:26,619 ERROR ----|^^^|----

It looks like the queue launcher thinks a firework is ready to launch, but then finds the queue is empty after calling launch_rocket_to_queue(). Any tips would be appreciated!

Thanks,
Jonathan


You received this message because you are subscribed to the Google Groups “fireworkflows” group.
To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/fireworkflows.
To view this discussion on the web visit https://groups.google.com/d/msgid/fireworkflows/848dd390-ba00-4ad9-8daf-815882c89347%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

You received this message because you are subscribed to the Google Groups “fireworkflows” group.

For more options, visit https://groups.google.com/d/optout.

To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].

To post to this group, send email to [email protected].

Visit this group at https://groups.google.com/group/fireworkflows.

To view this discussion on the web visit https://groups.google.com/d/msgid/fireworkflows/774f992d-1ece-4152-9f0d-88db91cbe1e3%40googlegroups.com.


Best,
Anubhav

Hi Anubhav,

That sequence of events sounds like the problem to me. I’ve tried again with FWS v1.4.1 and tentatively the problem seems to be fixed, thanks a lot! I do have a couple of additional questions:

-I’m using an anaconda virtual environment on one of the clusters I have access to, but the latest version of fireworks I found is 1.3.9. Is there a way to get access to the latest version?

-When I run a workflow, a folder named something like “block_2017-02-13-22-14-42-132705” is created. Inside are a bunch of folders for each firework with names like “launcher_2017-02-13-22-15-50-842035”. When debugging I’d like to inspect the error file in the folder that corresponds to a particular firework with some a name I can view from the data stored in my database, such as the fw_id. Is there a way to rename the “launcher…” folders with fw_id’s?

Thanks,

Jonathan

···

On Friday, February 10, 2017 at 10:52:33 AM UTC-8, Anubhav Jain wrote:

Hi Jonathan,

I am looking over this issue again.

  • I agree with you that if “launchpad.run_exists()” evaluates to True inside the rapidfire() method but then the same method evaluates to False within the “launch_rocket_to_queue” a short while later, that you would see the error trace that you mentioned
  • This could certainly happen if, for example the following sequence of events occurred:
  • There is a READY job in the LaunchPad
  • that job gets submitted to the queue successfully, but it is still READY since we are not in reservation mode (that’s fine)
  • the rapidfire() code loops again, again sees the (same) READY job in the LaunchPad and goes ahead with calling the launch_rocket_to_queue() method to submit another queue job.
  • However, before the launch_rocket_to_queue() gets to the part where it checks again for the existence of a job (launchpad.run_exists()), the job already queued has started RUNNING. Thus, in between the two calls to launchpad.run_exists(), the FW went from READY to RUNNING leaving behind no jobs to run when the second call happened.

Do you think this is the sequence of steps that is occurring?

I see two ways forward

  1. If a ready job “disappears” by the time a queue is going to be submitted, simply consider the current iteration of rapidfire() to be finished
  1. Try to count jobs so that the same READY job doesn’t lead to 2+ queue submissions. This would potentially have some benefits in creating 1:1 mappings of jobs to queue submissions, although it would be very difficult to prevent two simultaneous qlaunch processes (e.g. on different machines/workers) from colliding.

Solution (1) is certainly easier to do and so I implemented it.

Please try FWS v1.4.1 (just released) and let me know if this fixes it.

Anubhav

On Thursday, February 9, 2017 at 8:47:37 AM UTC-8, ajain wrote:

Hi Jonathan

Thanks for the update- I’ll take a closer look tomorrow.

Best

Anubhav

On Wed, Feb 8, 2017 at 5:44 PM, jkuck [email protected] wrote:

A bit more info, I’ve replicated the problem on a second cluster with the latest version of fireworks installed (1.4.0). The second cluster uses slurm instead of pbs. The only observable difference is the line where the error occurs, because I’m running the new version of fireworks:

2017-02-08 17:11:05,635 INFO Launching a rocket!

2017-02-08 17:11:05,637 DEBUG getting queue adapter

2017-02-08 17:11:05,673 INFO No jobs exist in the LaunchPad for submission to queue!

2017-02-08 17:11:05,673 ERROR ----|vvv|----

2017-02-08 17:11:05,673 ERROR Error with queue launcher rapid fire!

2017-02-08 17:11:05,674 ERROR Traceback (most recent call last):

File “/home/kuck/.local/lib/python2.7/site-packages/fireworks/queue/queue_launcher.py”, line 221, in rapidfire

raise RuntimeError("Launch unsuccessful!")

RuntimeError: Launch unsuccessful!

2017-02-08 17:11:05,675 ERROR ----|^^^|----

Best,

Jonathan

On Wednesday, February 8, 2017 at 1:49:45 PM UTC-8, jkuck wrote:

Hi Anubhav,

Correct me if I’m wrong, but I think the queue launcher is crashing before creating the launch directory. It looks like 'atlas/u/jkuck/rbpf_fireworks/block_2017-02-08-17-35-21-007249/launcher_2017-02-08-18-54-39-731710 ’ is the directory created by the successful submission.

The problem seems to be that somehow launchpad.run_exists(fworker) is evaluating to True in the while loop in rapidfire() in queue_launcher.py, but then false in launch_rocket_to_queue().

Best,
Jonathan

On Wed, Feb 8, 2017 at 1:32 PM, Anubhav Jain wrote:

Hi Jonathan

Ok, unfortunately that doesn’t provide much additional information, although it does seem like there are READY jobs to run in the queue.

Can you try something else? Immediately after the crash, “cd” to the directory listed by the qlauncher and manually try to submit the script. e.g., for a PBS queue system this would involve typing “qsub FW_submit.script”. The directory is listed in the debug output you printed, e.g., atlas/u/jkuck/rbpf_fireworks/block_2017-02-08-17-35-21-007249/launcher_2017-02-08-18-54-39-731710 for the previous time you tried this.

Sometime, manually submitting the script can help clarify what errors (if any) are being thrown by the queuing system.

Best,

Anubhav

On Wednesday, February 8, 2017 at 11:10:11 AM UTC-8, jkuck wrote:

Hi Anubhav,

Thanks a lot for the help. Here’s the info:

  1. Can you paste the output of “lpad get_fws -s READY -d count” after the script crashes?

I’ve tried this after two crashes now. The first was ‘1’ and the second ‘2’.

  1. Would you mind running the script again with strm_lvl=“DEBUG” and pasting the output again?

Here is the output, I’ve included a successful submission as well:

2017-02-08 10:53:34,428 INFO Job submission was successful and job_id is 1176878

2017-02-08 10:53:34,428 INFO Sleeping for 5 seconds…zzz…

2017-02-08 10:53:39,455 INFO Finished a round of launches, sleeping for 60 secs

2017-02-08 10:54:39,516 INFO Checking for Rockets to run…

2017-02-08 10:54:39,555 INFO The number of jobs currently in the queue is: 0

2017-02-08 10:54:39,555 INFO 0 jobs in queue. Maximum allowed by user: 20

2017-02-08 10:54:39,640 INFO Launching a rocket!

2017-02-08 10:54:39,647 DEBUG getting queue adapter

2017-02-08 10:54:39,733 INFO Created new dir /atlas/u/jkuck/rbpf_fireworks/block_2017-02-08-17-35-21-007249/launcher_2017-02-08-18-54-39-731710

2017-02-08 10:54:39,733 INFO moving to launch_dir /atlas/u/jkuck/rbpf_fireworks/block_2017-02-08-17-35-21-007249/launcher_2017-02-08-18-54-39-731710

2017-02-08 10:54:39,734 DEBUG writing queue script

2017-02-08 10:54:39,740 INFO submitting queue script

2017-02-08 10:54:41,842 INFO Job submission was successful and job_id is 1176879

2017-02-08 10:54:41,843 INFO Sleeping for 5 seconds…zzz…

2017-02-08 10:54:46,933 INFO Launching a rocket!

2017-02-08 10:54:46,940 DEBUG getting queue adapter

2017-02-08 10:54:46,961 INFO No jobs exist in the LaunchPad for submission to queue!

2017-02-08 10:54:46,961 ERROR ----|vvv|----

2017-02-08 10:54:46,962 ERROR Error with queue launcher rapid fire!

2017-02-08 10:54:46,965 ERROR Traceback (most recent call last):

File “/atlas/u/jkuck/software/anaconda2/envs/anaconda_venv/lib/python2.7/site-packages/fireworks/queue/queue_launcher.py”, line 216, in rapidfire

raise RuntimeError("Launch unsuccessful!")

RuntimeError: Launch unsuccessful!

2017-02-08 10:54:46,965 ERROR ----|^^^|----

Best,

Jonathan

On Wednesday, February 8, 2017 at 9:56:07 AM UTC-8, Anubhav Jain wrote:

Hi Jonathan

Two things:

  1. Can you paste the output of “lpad get_fws -s READY -d count” after the script crashes?
  1. Would you mind running the script again with strm_lvl=“DEBUG” and pasting the output again?

I haven’t seen or heard of this error before so it might take a little back and forth to figure out what’s happening.

Best,

Anubhav

On Tuesday, February 7, 2017 at 11:58:32 PM UTC-8, jkuck wrote:

Yes, the queue launcher crashes again after being restarted. I’m calling the queue launcher with fill_mode=false:

rapidfire(launchpad, FWorker(), qadapter, launch_dir=’.’, nlaunches=‘infinite’, njobs_queue=20,

njobs_block=500, sleep_time=None, reserve=False, strm_lvl=‘INFO’, timeout=None,

fill_mode=False)

Thanks,

Jonathan
On Tuesday, February 7, 2017 at 11:51:50 PM UTC-8, Joseph Montoya wrote:

Just to get a bit more info, does the issue persist when you restart the queue launcher? Also, are you using fill mode?

Best,

Joey

On Feb 7, 2017, at 11:30 PM, jkuck [email protected].com wrote:

Hi,

I’m trying to run a long workflow that dynamically creates new fireworks at every iteration. I’m running the workflow with a queue launcher in infinite mode. Usually after around 5 iterations (50-100 fireworks) the queue launcher crashes as follows:

2017-02-07 22:56:21,500 INFO Sleeping for 5 seconds…zzz…

2017-02-07 22:56:26,592 INFO Launching a rocket!

2017-02-07 22:56:26,616 INFO No jobs exist in the LaunchPad for submission to queue!

2017-02-07 22:56:26,616 ERROR ----|vvv|----

2017-02-07 22:56:26,616 ERROR Error with queue launcher rapid fire!

2017-02-07 22:56:26,618 ERROR Traceback (most recent call last):

File “/atlas/u/jkuck/software/anaconda2/envs/anaconda_venv/lib/python2.7/site-packages/fireworks/queue/queue_launcher.py”, line 216, in rapidfire

raise RuntimeError("Launch unsuccessful!")

RuntimeError: Launch unsuccessful!

2017-02-07 22:56:26,619 ERROR ----|^^^|----

It looks like the queue launcher thinks a firework is ready to launch, but then finds the queue is empty after calling launch_rocket_to_queue(). Any tips would be appreciated!

Thanks,
Jonathan


You received this message because you are subscribed to the Google Groups “fireworkflows” group.
To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/fireworkflows.
To view this discussion on the web visit https://groups.google.com/d/msgid/fireworkflows/848dd390-ba00-4ad9-8daf-815882c89347%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

You received this message because you are subscribed to the Google Groups “fireworkflows” group.

For more options, visit https://groups.google.com/d/optout.

To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].

To post to this group, send email to [email protected].

Visit this group at https://groups.google.com/group/fireworkflows.

To view this discussion on the web visit https://groups.google.com/d/msgid/fireworkflows/774f992d-1ece-4152-9f0d-88db91cbe1e3%40googlegroups.com.


Best,
Anubhav

Hi Jonathan,

Great to hear the queue launcher issue seems fixed (and thanks for pointing out the issue).

For the other two items, can you submit separate tickets? This will help keep things organized for people looking for answers to common questions.

Best

Anubhav

···

On Tuesday, February 7, 2017 at 11:30:22 PM UTC-8, jkuck wrote:

Hi,

I’m trying to run a long workflow that dynamically creates new fireworks at every iteration. I’m running the workflow with a queue launcher in infinite mode. Usually after around 5 iterations (50-100 fireworks) the queue launcher crashes as follows:

2017-02-07 22:56:21,500 INFO Sleeping for 5 seconds…zzz…

2017-02-07 22:56:26,592 INFO Launching a rocket!

2017-02-07 22:56:26,616 INFO No jobs exist in the LaunchPad for submission to queue!

2017-02-07 22:56:26,616 ERROR ----|vvv|----

2017-02-07 22:56:26,616 ERROR Error with queue launcher rapid fire!

2017-02-07 22:56:26,618 ERROR Traceback (most recent call last):

File “/atlas/u/jkuck/software/anaconda2/envs/anaconda_venv/lib/python2.7/site-packages/fireworks/queue/queue_launcher.py”, line 216, in rapidfire

raise RuntimeError("Launch unsuccessful!")

RuntimeError: Launch unsuccessful!

2017-02-07 22:56:26,619 ERROR ----|^^^|----

It looks like the queue launcher thinks a firework is ready to launch, but then finds the queue is empty after calling launch_rocket_to_queue(). Any tips would be appreciated!

Thanks,
Jonathan

Good point, done.

Jonathan

···

On Monday, February 13, 2017 at 2:49:56 PM UTC-8, Anubhav Jain wrote:

Hi Jonathan,

Great to hear the queue launcher issue seems fixed (and thanks for pointing out the issue).

For the other two items, can you submit separate tickets? This will help keep things organized for people looking for answers to common questions.

Best

Anubhav

On Tuesday, February 7, 2017 at 11:30:22 PM UTC-8, jkuck wrote:

Hi,

I’m trying to run a long workflow that dynamically creates new fireworks at every iteration. I’m running the workflow with a queue launcher in infinite mode. Usually after around 5 iterations (50-100 fireworks) the queue launcher crashes as follows:

2017-02-07 22:56:21,500 INFO Sleeping for 5 seconds…zzz…

2017-02-07 22:56:26,592 INFO Launching a rocket!

2017-02-07 22:56:26,616 INFO No jobs exist in the LaunchPad for submission to queue!

2017-02-07 22:56:26,616 ERROR ----|vvv|----

2017-02-07 22:56:26,616 ERROR Error with queue launcher rapid fire!

2017-02-07 22:56:26,618 ERROR Traceback (most recent call last):

File “/atlas/u/jkuck/software/anaconda2/envs/anaconda_venv/lib/python2.7/site-packages/fireworks/queue/queue_launcher.py”, line 216, in rapidfire

raise RuntimeError("Launch unsuccessful!")

RuntimeError: Launch unsuccessful!

2017-02-07 22:56:26,619 ERROR ----|^^^|----

It looks like the queue launcher thinks a firework is ready to launch, but then finds the queue is empty after calling launch_rocket_to_queue(). Any tips would be appreciated!

Thanks,
Jonathan