Error writing/submitting queue script!

Hello, all

I met an error when launching the workflow into queue, and I still have no idea how to slove it after several tries. I would appreciate it much if you can give me some suggestions. Here is the specific information:

  • error message

`(atomate) [umjzhh@psn004 wf_test]$ slaunch
Successfully loaded your FW_config.yaml!
2018-03-11 22:16:25,403 INFO getting queue adapter
2018-03-11 22:16:25,406 INFO Found previous block, using /home/export/online1/umjzhh/calculations/block_2018-03-06-14-33-45-310798
2018-03-11 22:16:25,428 INFO 0 jobs in queue. Maximum allowed by user: 0
2018-03-11 22:16:25,461 INFO Launching a rocket!
2018-03-11 22:16:25,615 INFO reserved FW with fw_id: 4
2018-03-11 22:16:25,627 INFO Created new dir /home/export/online1/umjzhh/calculations/block_2018-03-06-14-33-45-310798/launcher_2018-03-11-14-16-25-615544
2018-03-11 22:16:25,631 INFO moving to launch_dir /home/export/online1/umjzhh/calculations/block_2018-03-06-14-33-45-310798/launcher_2018-03-11-14-16-25-615544
2018-03-11 22:16:25,648 INFO submitting queue script
2018-03-11 22:16:25,674 ERROR ----|vvv|----
2018-03-11 22:16:25,675 ERROR Error writing/submitting queue script!
2018-03-11 22:16:25,678 ERROR Traceback (most recent call last):
File “/home/export/online1/umjzhh/opt/anaconda3/lib/python3.6/site-packages/fireworks/queue/queue_launcher.py”, line 136, in launch_rocket_to_queue
raise RuntimeError('queue script could not be submitted, check queue ’
RuntimeError: queue script could not be submitted, check queue script/queue adapter/queue server status!

2018-03-11 22:16:25,679 ERROR ----|^^^|----
2018-03-11 22:16:25,680 INFO Un-reserving FW with fw_id, launch_id: 4, 2
2018-03-11 22:16:25,721 ERROR ----|vvv|----
2018-03-11 22:16:25,722 ERROR Error with queue launcher rapid fire!
2018-03-11 22:16:25,723 ERROR Traceback (most recent call last):
File “/home/export/online1/umjzhh/opt/anaconda3/lib/python3.6/site-packages/fireworks/queue/queue_launcher.py”, line 236, in rapidfire
raise RuntimeError(“Launch unsuccessful!”)
RuntimeError: Launch unsuccessful!

2018-03-11 22:16:25,723 ERROR ----|^^^|----`

  • Queue management system: custom LSF

This system is developed by imitating standard LSF. Therefore part of commands are different.

  • submit job by script
    bsub -f FW_submit.script \ -f is indispensable. It seems that in standard LSF there is no need to use -f

  • batch=1 nodes=1 ntasks-per-node=16 queue=q_x86_expr job-name=Si-structure_optimiz output=Si-structure_optimiz.out command=rlaunch -c /home/export/online1/umjzhh/atomate/config singleshot --fw_id 4

    the format of FW_submit.script
    I changed the template of fireworks/user_objects/queue_adapters/LoadSharingFacility_template.txt to satisify this custom LSF. The finally FW_submit.script produced by atomate is like following. Besides, when I use “bsub -f FW_submit.script”, the firetask can be submitted but it soon would become FIZZLED.

  • query command
    It seems that the optional choices of “bjobs” are different from standard LSF. The usage of custom LSF “bjobs” is as following
    (atomate) [umjzhh@psn004 launcher_2018-03-11-14-16-25-615544]$ bjobs -h Usage: bjobs [-w] [-l] [-J jobname] [-q queue] [-u user] [-a|-d|-e|-p|-r] [jobid]

    elif self.q_type == "LoadSharingFacility": #use no header and the wide format so that there is one line per job, and display only running and pending jobs #status_cmd.extend(['-p','-r','-o', 'jobID user queue', '-noheader', '-u', username]) status_cmd.extend(['-u', username])

    And I found that if don’t change fireworks/user_objects/queue_adapters/common_adapter.py as following, when I use “slaunch”, the program will get stuck and nothing will be print out. However, I don’t know weather my change is right or it would cause any other problem.

  • Other information

  • about “slaunch”
    I add some codes in ~/.bash-mpenvrc. The codes is like following
    function slaunch() { cd ${MP_LAUNCH_ROOT:=$HOME/online1/launcher} n_launcher=${1:-1} qlaunch -r rapidfire --nlaunches $n_launcher -b 10000 cd $OLDPWD }

  • platform: Linux version 2.6.32-431.29.2.lustre.el6.x86_64 (root@gio017) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-11) (GCC) ) #1 SMP Fri Jul 31 09:39:58 CST 2015
    my python version: Python 3.6.4 :: Anaconda, Inc.
    atomate: downloaded from github last week

Thank you a lot for your help!

Best Regards!

Hi,

Unfortunately I never worked with LSF so I am unsure of some its quirks and differences between various implementations.

(i) For getting the script submitted in the first place, can you see if this “patch” works for now:

fireworks/user_objects/queue_adapters/common_adapter.py:211

change:

p = subprocess.Popen([submit_cmd],stdin=inputFile,stdout=subprocess.PIPE,stderr=subprocess.PIPE)

to:


p = subprocess.Popen([submit_cmd, “-f”],stdin=inputFile,stdout=subprocess.PIPE,stderr=subprocess.PIPE)


If that works, it at least means submission to the queue is being handled correctly. We can deal with the FIZZLED in a separate ticket that has more details about that

(ii) For the status / query command, your modification is correct given your custom LSF.

It is up to you whether you prefer to leave these "hacks" in place for your FireWorks installation, or if you want to submit a revised LSF implementation particular for your cluster to the main repo. e.g.,  this would add a new queue option like "LoadSharingFacility_<name>".

Best,

Anubhav

The original queue adapter was contributed by Zachary Ulissi with some further contributions by Jacob Russell Boes; you might be able to track them down.

<details class='elided'>
<summary title='Show trimmed content'>&#183;&#183;&#183;</summary>

On Monday, March 12, 2018 at 10:04:07 PM UTC-7, [email protected] wrote:
> Hello, all

> I met an error when launching the workflow into queue, and I still have no idea how to slove it after several tries. I would appreciate it much if you can give me some suggestions. Here is the specific information:

> 

> - **error message**

> > `(atomate) [umjzhh@psn004 wf_test]$ slaunch
> > Successfully loaded your FW_config.yaml!
> > 2018-03-11 22:16:25,403 INFO getting queue adapter
> > 2018-03-11 22:16:25,406 INFO Found previous block, using /home/export/online1/umjzhh/calculations/block_2018-03-06-14-33-45-310798
> > 2018-03-11 22:16:25,428 INFO 0 jobs in queue. Maximum allowed by user: 0
> > 2018-03-11 22:16:25,461 INFO Launching a rocket!
> > 2018-03-11 22:16:25,615 INFO reserved FW with fw_id: 4
> > 2018-03-11 22:16:25,627 INFO Created new dir /home/export/online1/umjzhh/calculations/block_2018-03-06-14-33-45-310798/launcher_2018-03-11-14-16-25-615544
> > 2018-03-11 22:16:25,631 INFO moving to launch_dir /home/export/online1/umjzhh/calculations/block_2018-03-06-14-33-45-310798/launcher_2018-03-11-14-16-25-615544
> > 2018-03-11 22:16:25,648 INFO submitting queue script
> > 2018-03-11 22:16:25,674 ERROR ----|vvv|----
> > 2018-03-11 22:16:25,675 ERROR Error writing/submitting queue script!
> > 2018-03-11 22:16:25,678 ERROR Traceback (most recent call last):
> >   File "/home/export/online1/umjzhh/opt/anaconda3/lib/python3.6/site-packages/fireworks/queue/queue_launcher.py", line 136, in launch_rocket_to_queue
> >     raise RuntimeError('queue script could not be submitted, check queue '
> > RuntimeError: queue script could not be submitted, check queue script/queue adapter/queue server status!
> > 
> > 
> > 2018-03-11 22:16:25,679 ERROR ----|^^^|----
> > 2018-03-11 22:16:25,680 INFO Un-reserving FW with fw_id, launch_id: 4, 2
> > 2018-03-11 22:16:25,721 ERROR ----|vvv|----
> > 2018-03-11 22:16:25,722 ERROR Error with queue launcher rapid fire!
> > 2018-03-11 22:16:25,723 ERROR Traceback (most recent call last):
> >   File "/home/export/online1/umjzhh/opt/anaconda3/lib/python3.6/site-packages/fireworks/queue/queue_launcher.py", line 236, in rapidfire
> >     raise RuntimeError("Launch unsuccessful!")
> > RuntimeError: Launch unsuccessful!
> > 
> > 
> > 2018-03-11 22:16:25,723 ERROR ----|^^^|----`

> > 

> - **Queue management system: custom LSF**

> > This system is developed by imitating standard LSF. Therefore part of commands are different.
> - submit job by script
>   bsub -f FW_submit.script     \\ **-f** is indispensable. It seems that in standard LSF there is no need to use -f
> - `batch=1
>   nodes=1
>   ntasks-per-node=16
>   queue=q_x86_expr
>   job-name=Si-structure_optimiz
>   output=Si-structure_optimiz.out
>   command=rlaunch -c /home/export/online1/umjzhh/atomate/config singleshot --fw_id 4`

>   the format of FW_submit.script
>   I changed the template of fireworks/user_objects/queue_adapters/LoadSharingFacility_template.txt to satisify this custom LSF. The finally FW_submit.script produced by atomate is like following. Besides, when I use "bsub -f FW_submit.script", the firetask can be submitted but it soon would become FIZZLED. 
>   
> - query command
>   It seems that the optional choices of "bjobs" are different from standard LSF. The usage of custom LSF "bjobs" is as following
>   `(atomate) [umjzhh@psn004 launcher_2018-03-11-14-16-25-615544]$ bjobs -h
>   Usage: bjobs [-w] [-l] [-J jobname] [-q queue] [-u user] [-a|-d|-e|-p|-r] [jobid]`

>   `elif self.q_type == "LoadSharingFacility":
>    #use no header and the wide format so that there is one line per job, and display only running and pending jobs
>    #status_cmd.extend(['-p','-r','-o', 'jobID user queue', '-noheader', '-u', username])
>    status_cmd.extend(['-u', username])`

>   And I found that if don't change fireworks/user_objects/queue_adapters/common_adapter.py as following, when I use "slaunch", the program will get stuck and nothing will be print out. However, I don't know weather my change is right or it would cause any other problem.

> - **Other information**
> - about "slaunch"
>   I add some codes in ~/.bash-mpenvrc. The codes is like following
> `function slaunch() {
>   cd ${MP_LAUNCH_ROOT:=$HOME/online1/launcher}
>   n_launcher=${1:-1}
>   qlaunch -r rapidfire --nlaunches $n_launcher -b 10000
>   cd $OLDPWD
> }`

> - platform: Linux version 2.6.32-431.29.2.lustre.el6.x86_64 (root@gio017) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-11) (GCC) ) #1 SMP Fri Jul 31 09:39:58 CST 2015
>   my python version: Python 3.6.4 :: Anaconda, Inc.
>   atomate: downloaded from github last week

> Thank you a lot for your help!

> Best Regards!

</details>