How to use two queue adapters for FireWorks with depdencies?

lastephey · March 31, 2023, 10:45pm

Dear FireWorks developers and community,

I’m putting together a FireWorks demo for an upcoming workshop at NERSC (Advertising Workflows/FireWorks training at NERSC April 12, 2023) and I have a question for you all.

I’d like to demonstrate a 3 step workflow with multiple FireWorks with dependencies:

fws:

fw_id: 1
spec:
_tasks:
- _fw_name: step1
  script: srun step_1_diabetes_preprocessing.py
fw_id: 2
spec:
_tasks:
- _fw_name: step2
  script: srun step_2_diabetes_correlation.py
fw_id: 3
spec:
_tasks:
- _fw_name: step3
  script: srun step_3_diabetes_postprocessing.py
  links:
  1:
- 2
  2:
- 3
  metadata: {}
  ~

Additionally, I’d like to use two queue adapters so that Step 2 can use more resources than Step 1 and Step 3:

Step 1, 1 node using my_queueadapter1.yaml
Step 2, 2 nodes using my_queueadapter2.yaml
Step 3, 1 node using my_queueadapter1.yaml

I am struggling with how to accomplish both of these things in a single FireWorks workflow. I see that this situation is described here in step 3:

github.com

materialsproject/fireworks/blob/83a907c19baf2a5c9fdcf63996f9797c3c85b785/docs/_sources/queue_tutorial.rst.txt#limitations-and-next-steps

==============================
Launch Rockets through a queue
==============================

If your FireWorker is a large, shared resource (such as a computing cluster or supercomputing center), you probably won't be able to launch Rockets directly. Instead, you'll submit Rockets through an existing queueing system that allocates computer resources.

The simplest way to execute jobs through a queue would be to write a templated queue file and then submit it as a two-task Firework, as in the :doc:`Firetask tutorial </firetask_tutorial>`. However, FireWorks then considers your "job" to only be queue submission, and will consider the job completed after the queue submission is complete. FireWorks will not know when the actual payload starts running, or is finished, or if the job finishes successfully. Thus, many of the useful management and monitoring features of FireWorks will not be available to you.

A more powerful way to execute jobs through a queue is presented in this tutorial. In this method, the queue file runs ``rlaunch`` instead of running your desired program. This method is just like typing ``rlaunch`` into a Terminal window like in the core tutorials, except that now we are submitting a queue script that does the typing for us (it's very low-tech!). In particular, FireWorks is *completely unaware* that you are running through a queue!

The jobs we will submit to the queue are basically placeholder jobs that are asleep until the job starts running. When the job is actually assigned computer resources and runs, the script "wakes" up and runs the Rocket Launcher, which then figures out what Firework to run.

The advantage of this low-tech system is that it is quite durable; if your queue system goes down or you delete a job from the queue, there are zero repercussions. You don't have to tell FireWorks to run those jobs somewhere else, because FireWorks never knew about your queue in the first place. In addition, if you are running on multiple machines and the queue becomes backlogged on one of them, it does not matter at all. Your submission job stuck in the queue is not preventing high-priority jobs from running on other machines.

There are also some disadvantages to this simple system for which you might want to tell FireWorks about the queue. We'll discuss these limitations at the end of the tutorial and direct you on how to overcome them in the next tutorial. For now, we suggest that you get things working simply.

Launch a single job through a queue
===================================

To get warmed up, let's set up a *Queue Launcher* to run a single Firework through a queueing system.

This file has been truncated. show original

but I don’t understand how I should set up the workflow to accomplish this. Adding the _queueadapter parameter to each entry in the spec would be a nice option, but I see that only works in reservation mode. My other idea was to write a separate fw_task.yaml for each task, and add to the queue like

qlaunch singleshot -q my_queueadapter1.yaml step_1_fw.yaml
qlaunch singleshot -q my_queueadapter2.yaml step_2_fw.yaml
qlaunch singleshot -q my_queueadapter1.yaml step_3_fw.yaml

but the issue is that I don’t know how to specify dependencies between tasks if I do it like this.

Can you please offer some advice? I’d really appreciate any help or examples you could point me towards.

Thank you very much,
Laurie

Anubhav_Jain · April 1, 2023, 12:25am

Hi Laurie,

You should be able to set up each of the queue adapter files to use different fireworkers. The fireworkers in turn determine which jobs to pull.

First, you want to set up two my_fworker.yaml files. The first my_fworker1.yaml file is configured to only pull FWs like Step 1 and Step 3. The second my_fworker2.yaml file is configured to only pull FWs like Step 2. You can set up these two my_fworker1.yaml and my_fworker2.yaml files using guidance in: Controlling the directory and Worker of execution — FireWorks 2.0.3 documentation . See the section about “Controlling the Worker that executes a Firework” for options on how to connect my_fworker1.yaml to FWs 1 and 3, and my_fworker2.yaml to FW 2.

Next, you want your qadapter files to be linked to the appropriate fireworkers. That is, my_qadapter1.yaml should use my_fworker1.yaml so it only pulls jobs like FWs 1 and 3, and my_qadapter2.yaml should use my_fworker2.yaml so it only pulls jobs like FW 2. To do this you should do two things:

When executing qlaunch, use both the -w and -q options together with the right correspondence. e.g., qlaunch -q my_qadapter1.yaml -w my_fworker1.yaml ...
qlaunch -q my_qadapter2.yaml -w my_fworker2.yaml ...
Within each my_qadapter.yaml file, make sure that when the job “wakes” up it is only pulling jobs from the correct fworker. So the rlaunch command inside my_qadapter1.yaml should contain the -w my_fworker1.yaml option, and the rlaunch command inside my_qadapter2.yaml should contain the ``my_fworker2.yaml``` option

Hopefully this all makes sense, let us know how it goes

Anubhav_Jain · April 1, 2023, 12:27am

Btw, the other option (if you just want to execute a single qlaunch command and not two of them) is to use reservation mode, but you mentioned that wasn’t the solution you wanted so I didn’t cover that

lastephey · April 1, 2023, 3:17am

Dear Anubhav,

Thank you very much for your fast and detailed reply, especially on a Friday afternoon!

I think I understood the second part of what you mentioned better than the first. I created a my_qadapter1.yaml and a my_qdapter2.yaml and made sure to include the name of the appropriate my_fworker.yaml file for each:

low-training/FireWorks/NERSC> cat my_qadapter1.yaml 
_fw_name: CommonAdapter
_fw_q_type: SLURM
rocket_launch: rlaunch -w my_fworker1.yaml -l my_launchpad.yaml singleshot
constraint: cpu
nodes: 1
ntasks: 1
account: nstaff
walltime: '00:05:00'
queue: debug
job_name: null
logdir: null
pre_rocket: null
post_rocket: null

I think this part makes sense, but please let me know if I got it wrong.

Next, I have created both a my_fworker1.yaml and a my_fworker2.yaml. I am a bit unsure of whether I have set _fworker and name correctly. I am trying to follow Method 1 on the “Controlling the worker” page:

low-training/FireWorks/NERSC> cat my_fworker1.yaml 
fws:
- fw_id: 1
  name: one_node
  spec:
    _fworker: one_node
    _tasks:
    - _fw_name: ScriptTask
      script: srun step_1_diabetes_preprocessing.py
- fw_id: 3
  name: one_node
  spec:
    _fworker: one_node
    _tasks:
    - _fw_name: ScriptTask
      script: srun step_3_diabetes_postprocessing.py
links:
  1:
  - 2
  2:
  - 3
metadata: {}

I wasn’t sure if the name/_fworker combination should be the same or different for tasks 1 and 3. Right now I am getting a key error with this setup, so I think what I’ve done is wrong:

low-training/FireWorks/NERSC> qlaunch -q my_qadapter1.yaml -w my_fworker1.yaml

# cutting out most of the traceback

  File "/global/common/software/das/stephey/conda/conda_envs/fireworks/lib/python3.9/site-packages/fireworks/core/fworker.py", line 55, in from_dict
    return FWorker(m_dict["name"], m_dict["category"], json.loads(m_dict["query"]), m_dict.get("env"))
KeyError: 'name'

Finally for completeness, here is my my_fworker2.yaml

low-training/FireWorks/NERSC> cat my_fworker2.yaml
fws:
- fw_id: 2
  name: step2
  spec:
    - _fworker: step2
    _tasks:
    - _fw_name: ScriptTask
      script: srun step_2_diabetes_correlation.py
links:
  1:
  - 2
  2:
  - 3
metadata: {}

Final question- have I handled the dependencies correctly? Now that I’m splitting up the tasks in two files, I wasn’t sure if I needed to modify the links part of the file.

Thank you again for your help,
Laurie

Anubhav_Jain · April 1, 2023, 9:24pm

Hi Laurie,

The way you have set up the my_fworker1.yaml file, I believe there may be some confusion between FireWorks and FireWorkers (they sound really similar, they are actually very different …)

A FireWorker does two things:

controls what jobs should be run on a machine
gives some machine-specific settings that may be needed for a job

It does not have scripts to run, tasks, or anything else like that. So the first thing you need to do is to clean up the my_fworker1.yaml file so it doesn’t have a workflow in it and start that file from scratch. You should be able to look at:

https://materialsproject.github.io/fireworks/worker_tutorial.html

To see how to set up a basic my_fworker.yaml file. Then, pick one of the methods I mentioned for connecting the specific FireWorks with the associated FireWorker. That link again is:

https://materialsproject.github.io/fireworks/controlworker.html

There are a few different ways to accomplish what you mentioned, to make things simple I suggest the “category” method where there are two categories of jobs.

Hopefully those links plus the knowledge that the my_fworker.yaml file should be completely different than the workflow definitions should get you on the right track …

lastephey · April 3, 2023, 4:11am

Dear Anubhav,

Thank you very much for your quick reply and helpful information!!! I really appreciate it.

You were right- I was fundamentally misunderstanding FireWork vs FireWorker. Thank you for helping fix this and come to the correct understanding that the FireWorker is kind of a job configuration setting file, and the FireWork describes the workflow itself.

With this missing piece, I was finally able to run my desired workflow, so thank you!

I do have one final question, though. With my current setup, I noticed that I have to issue 3 qlaunch commands to get all of my FireWorks to run.

qlaunch -q my_qadapter1.yaml -w my_fworker1.yaml singleshot #launches step 1
qlaunch -q my_qadapter2.yaml -w my_fworker2.yaml singleshot #launches step 2
qlaunch -q my_qadapter1.yaml -w my_fworker1.yaml singleshot #needed to launch step 3, which is in state WAITING

I was naively expecting that I would only have to issue two qlaunches, one for each queue adaptor configuration. Is this a consequence of the workflow structure I have chosen, or is there some additional configuration setting/strategy I should use?

For completeness, here is my whole workflow:

stephey@perlmutter:login02:~> module load python
stephey@perlmutter:login02:~> conda activate fireworks
(/global/common/software/das/stephey/conda/conda_envs/fireworks) stephey@perlmutter:login02:~> cd /pscratch/sd/s/stephey/DOE-HPC-workflow-training/FireWorks/NERSC
(/global/common/software/das/stephey/conda/conda_envs/fireworks) stephey@perlmutter:login02:/pscratch/sd/s/stephey/DOE-HPC-workflow-training/FireWorks/NERSC> lpad reset
Are you sure? This will RESET 1 workflows and all data. (Y/N)y
2023-04-02 20:52:07,940 INFO Performing db tune-up
2023-04-02 20:52:08,001 INFO LaunchPad was RESET.
(/global/common/software/das/stephey/conda/conda_envs/fireworks) stephey@perlmutter:login02:/pscratch/sd/s/stephey/DOE-HPC-workflow-training/FireWorks/NERSC> cat fw_diabetes_wf.yaml 
fws:
- fw_id: 1
  spec:
    _category: onenode
    _tasks:
    - _fw_name: ScriptTask
      script: srun python step_1_diabetes_preprocessing.py
- fw_id: 2
  spec:
    _category: twonode
    _tasks:
    - _fw_name: ScriptTask
      script: srun -n 10 python step_2_diabetes_correlation.py
- fw_id: 3
  spec:
    _category: onenode
    _tasks:
    - _fw_name: ScriptTask
      script: srun python step_3_diabetes_postprocessing.py
links:
  1:
  - 2
  2:
  - 3
metadata: {}
(/global/common/software/das/stephey/conda/conda_envs/fireworks) stephey@perlmutter:login02:/pscratch/sd/s/stephey/DOE-HPC-workflow-training/FireWorks/NERSC> lpad add fw_diabetes_wf.yaml
2023-04-02 20:59:16,444 INFO Added a workflow. id_map: {1: 1, 2: 2, 3: 3}
(/global/common/software/das/stephey/conda/conda_envs/fireworks) stephey@perlmutter:login02:/pscratch/sd/s/stephey/DOE-HPC-workflow-training/FireWorks/NERSC> cat my_fworker1.yaml
name: one node fireworker
category: onenode
query: '{}'
(/global/common/software/das/stephey/conda/conda_envs/fireworks) stephey@perlmutter:login02:/pscratch/sd/s/stephey/DOE-HPC-workflow-training/FireWorks/NERSC> cat my_qadapter1.yaml 
_fw_name: CommonAdapter
_fw_q_type: SLURM
rocket_launch: rlaunch -w my_fworker1.yaml -l my_launchpad.yaml singleshot
constraint: cpu
nodes: 1
ntasks: 1
account: nstaff
walltime: '00:05:00'
queue: debug
job_name: null
logdir: null
pre_rocket: null
post_rocket: null
(/global/common/software/das/stephey/conda/conda_envs/fireworks) stephey@perlmutter:login02:/pscratch/sd/s/stephey/DOE-HPC-workflow-training/FireWorks/NERSC> qlaunch -q my_qadapter1.yaml -w my_fworker1.yaml singleshot
2023-04-02 21:00:33,387 INFO moving to launch_dir /pscratch/sd/s/stephey/DOE-HPC-workflow-training/FireWorks/NERSC
2023-04-02 21:00:33,390 INFO submitting queue script
2023-04-02 21:00:34,003 INFO Job submission was successful and job_id is 6895033
(/global/common/software/das/stephey/conda/conda_envs/fireworks) stephey@perlmutter:login02:/pscratch/sd/s/stephey/DOE-HPC-workflow-training/FireWorks/NERSC> lpad get_fws
[
    {
        "fw_id": 1,
        "created_on": "2023-04-03T03:59:16.437426",
        "updated_on": "2023-04-03T04:00:43.350942",
        "state": "COMPLETED",
        "name": "Unnamed FW"
    },
    {
        "fw_id": 2,
        "created_on": "2023-04-03T03:59:16.437581",
        "updated_on": "2023-04-03T04:00:43.352286",
        "state": "READY",
        "name": "Unnamed FW"
    },
    {
        "fw_id": 3,
        "created_on": "2023-04-03T03:59:16.437671",
        "updated_on": "2023-04-03T03:59:16.437671",
        "name": "Unnamed FW",
        "state": "WAITING"
    }
]
(/global/common/software/das/stephey/conda/conda_envs/fireworks) stephey@perlmutter:login02:/pscratch/sd/s/stephey/DOE-HPC-workflow-training/FireWorks/NERSC> qlaunch -q my_qadapter2.yaml -w my_fworker2.yaml singleshot
2023-04-02 21:01:03,996 INFO moving to launch_dir /pscratch/sd/s/stephey/DOE-HPC-workflow-training/FireWorks/NERSC
2023-04-02 21:01:03,997 INFO submitting queue script
2023-04-02 21:01:04,101 INFO Job submission was successful and job_id is 6895035
(/global/common/software/das/stephey/conda/conda_envs/fireworks) stephey@perlmutter:login02:/pscratch/sd/s/stephey/DOE-HPC-workflow-training/FireWorks/NERSC> lpad get_fws
[
    {
        "fw_id": 1,
        "created_on": "2023-04-03T03:59:16.437426",
        "updated_on": "2023-04-03T04:00:43.350942",
        "state": "COMPLETED",
        "name": "Unnamed FW"
    },
    {
        "fw_id": 2,
        "created_on": "2023-04-03T03:59:16.437581",
        "updated_on": "2023-04-03T04:01:09.537061",
        "state": "RUNNING",
        "name": "Unnamed FW"
    },
    {
        "fw_id": 3,
        "created_on": "2023-04-03T03:59:16.437671",
        "updated_on": "2023-04-03T03:59:16.437671",
        "name": "Unnamed FW",
        "state": "WAITING"
    }
]
(/global/common/software/das/stephey/conda/conda_envs/fireworks) stephey@perlmutter:login02:/pscratch/sd/s/stephey/DOE-HPC-workflow-training/FireWorks/NERSC> qlaunch -q my_qadapter1.yaml -w my_fworker1.yaml singleshot
2023-04-02 21:01:22,716 INFO moving to launch_dir /pscratch/sd/s/stephey/DOE-HPC-workflow-training/FireWorks/NERSC
2023-04-02 21:01:22,718 INFO submitting queue script
2023-04-02 21:01:22,810 INFO Job submission was successful and job_id is 6895040
(/global/common/software/das/stephey/conda/conda_envs/fireworks) stephey@perlmutter:login02:/pscratch/sd/s/stephey/DOE-HPC-workflow-training/FireWorks/NERSC> lpad get_fws
[
    {
        "fw_id": 1,
        "created_on": "2023-04-03T03:59:16.437426",
        "updated_on": "2023-04-03T04:00:43.350942",
        "state": "COMPLETED",
        "name": "Unnamed FW"
    },
    {
        "fw_id": 2,
        "created_on": "2023-04-03T03:59:16.437581",
        "updated_on": "2023-04-03T04:01:11.994916",
        "state": "COMPLETED",
        "name": "Unnamed FW"
    },
    {
        "fw_id": 3,
        "created_on": "2023-04-03T03:59:16.437671",
        "updated_on": "2023-04-03T04:01:30.222269",
        "state": "COMPLETED",
        "name": "Unnamed FW"
    }
]
(/global/common/software/das/stephey/conda/conda_envs/fireworks) stephey@perlmutter:login02:/pscratch/sd/s/stephey/DOE-HPC-workflow-training/FireWorks/NERSC>

Thank you again,
Laurie

Anubhav_Jain · April 3, 2023, 5:18pm

Hi Laurie,

As always, there are a few different ways to do things depending on what exactly you want to do. The two ways to minimize qlaunch commands are:

(1) Have the qlaunch command submit more than one job and/or
(2) run multiple Fireworks per queue submission, so that a single queued job runs multiple things

The simplest solution to understand is (1) so let’s start with that. If you replace qlaunch singleshot with qlaunch rapidfire, it will submit more than 1 job to the queue. The rapidfire command can also do things like loop, e.g. submit a set number of jobs to the queue, wait a while, and then submit more - or maintain a certain number of jobs in the queue. See the help docs via qlaunch rapidfire -h for more details on the various options. Unfortunately, when you don’t run in reservation mode, there is not really a 1:1 correspondence between job submissions and items to run in the database. So you may end up submitting many more queued jobs than actually can be run, and those jobs will wake up at NERSC, find nothing to run, and then quit. That is something to just have to live with if you are not running in reservation mode. Some people actually prefer this function because they can submit a bunch of jobs to NERSC, and while those jobs are aging, can populate and/or modify their database of jobs. Anyway, solution #1 is to use qlaunch rapidfire.

Solution #2 is to run multiple jobs per queued submission. You can do this by replacing rlaunch singleshot in your my_queueadapter.yaml file to rlaunch rapidfire. Although, in your case this will be a bit trickier to get right because you are alternating nodes. i.e. you onenode job will wake up, launch FW1 on that node, and then technically it needs to wait around for FW2 to complete on the twonode job before it can proceed to running FW3 on that node. Depending on the rlaunch rapidfire parameters you use, it might instead quit after runinng FW1 becuase there are no eligible FWs to run (FW3 is ineligible because it depends on FW2 completing).

Anyway, hopefully this gets you in the right direction. A lot of this is complicated to explain as a set of instructions, but easier if you can understand the fundamentals of what is going on under the hood with qlaunch and rlaunch.

lastephey · April 10, 2023, 4:19am

Dear Anubhav,

Thanks for this information and apologies for the long delay in my reply. I think I have a better understanding of the relationship between qlaunch and rlaunch and also singleshot/rapidfire.

If you are interested, you can see the first draft of the FireWorks NERSC tutorial materials we have prepared for our training event on April 12.

I have a few hopefully final questions. I would like to demonstrate launching this workflow with the first option you described in your previous post (qlaunch rapidfire). I think this option will be appealing to most NERSC users since it makes more efficient use of their node hours. My idea was to launch it with something like

lpad reset
lpad add fw_diabetes_wf.yaml
qlaunch -q my_qadapter1.yaml -w my_fworker1.yaml rapidfire & qlaunch -q my_qadapter2.yaml -w my_fworker2.yaml rapidfire

When I do it like this though, the first task is COMPLETED but the second task never makes it past READY. Is what I’m doing conceptually wrong with two rapidfire commands like this?

I have also experimented with running the second command with qlaunch singleshot, and this does work. However, I can’t say I like this option very much since we have to wait to make sure the first task has completed. Is mixing qlaunch rapidfire with qlaunch singleshot the best option for this case, or is there a better approach?

Thank you very much,
Laurie

lastephey · April 10, 2023, 4:35am

Hi Anubhav,

Actually I think I answered my own question. I think the issue was that I was specifying the my_fworker.yaml file both on the command line and in the queue adapter- that was maybe causing a problem.

Doing this seems to do exactly what I wanted:

lpad reset
lpad add fw_diabetes_wf.yaml 
qlaunch -q my_qadapter1.yaml rapidfire & qlaunch -q my_qadapter2.yaml rapidfire

All of my FireWorks launched and completed successfully with this set of commands.

Further advice is welcome, but thank you so much! I really appreciate all your help in putting together this demo for our tutorial.

Best regards,
Laurie

jpb · April 10, 2023, 6:21am

I’m delighted to see @lastephey here – merely hours before I start googling the same question Thank you so much, you and @Anubhav_Jain have helped me figure this out.

In case this is of interest to you or any other readers, I am doing basically the same thing as Laurie – but in python. The application I’m working has job parameters that are dynamically generated (e.g. if a run is big, then we might want a longer walltime, or more resourced), so building and submitting these jobs from python seems the natural thing to do.

Here is a basic example which launches 2 MPI programs (in a linear DAG), with the second program needing twice as many nodes as the first: alcc-recipes/fireworks/test_3_python at jpb/fireworks · JBlaschke/alcc-recipes · GitHub

One more thing I want to build is a generator for the my_fworker_$n.yaml – I just spend 30 mins not seeing a typo there. Is there a way to specify the FireWorker on the command line instead of a specific file? E.g. instead of:

rocket_launch="rlaunch -l my_launchpad.yaml -w my_fworker_1.yaml singleshot"

have something like:

rocket_launch=f"rlaunch -l my_launchpad.yaml -w {fworker_str} singleshot"

where fworker_str contains a string representation of the FWorker class?

jpb · April 10, 2023, 6:30am

I think I spoke too soon – my second launch_rocket_to_queue here:

launch_rocket_to_queue(launchpad, FWorker(name="mpi_4_fworker", category="n4"), qadapter_2)

raises the error:

2023-04-09 23:24:19,789 INFO No jobs exist in the LaunchPad for submission to queue!

unless there is already a FireWork that satisfies its requirements. Is it possible that when the workflow is initialized, the second FirewTask is in the waiting state (as it should be as it depends on the first one). But for some reason the FirewWorker doesn’t wait around to be ready for the second firetask.

Any advice would be appreciated.

jpb · April 10, 2023, 7:05am

One more observation: If I add multiple workflows to the launchpad, I noticed that the slurm jobs quit after the first workflow is done – Sp this made me re-evaluated launch_rocket_to_queue maybe not the right thing to do here.

So I switched to rapdifire instead. Works like a charm – I noticed that it has quite a greedy strategy. It “launches” waaaay more rockets than it needs. So while jobs are waiting in the queue, new jobs (which will never be needed) are being added. I guess this is nice, because those jobs can wait in the line for you, while you prepare new ones. But ideally I would want to dial this back a bit (think of the situation where a large job would take hours to start, resulting in thousands of “dud” jobs in the queue also).

Anubhav_Jain · April 10, 2023, 7:06pm

Hi,

I’m trying to figure out what’s resolved and what’s still in question, but here’s my best attempt:

Sorry, I don’t know a way in the current code to dynamically set the FireWorker via a string using the command line. But, instead of running a bash command to run qlaunch, you could instead write a python script that runs the qlaunch rapidfire command and run the Python script. In the Python script you can do whatever you want including setting up different FireWorkers dynamically. Note that the bottom of most tutorials contain Python examples
If there are too many jobs being launched, you have two options. The first one is to use either the nlaunches or maxjobs_queue parameters to control the number of things being submitted, see qlaunch rapidfire -h for details. The second one is to use reservation mode, see the docs for details

Laurie, thank you for putting together the tutorial. Regrettably I don’t know if I’ll be able to review it in time but I’ll pass it along in case someone else is able to take a look

Anubhav_Jain · April 10, 2023, 7:13pm

Btw one more note - if you find yourself dynamically creating FireWorkers, it’s highly likely that the better solution is to run in reservation mode and have the fireworks themselves include the queue parameters they need. Then a single qlaunch/FireWorker can handle heterogeneous queue parameters. More info in the docs, however for this thread I believe the OP specified they didn’t want to use reservation mode.

lastephey · April 10, 2023, 8:14pm

Hi Anubhav,

Right- since we are teaching FireWorks to what we expect is a novice audience, I didn’t want to cover reservation mode.

No need to provide any feedback- we just wanted to share with you what we’ve developed. We’re continuing to update our repo, but you’ll eventually be able to find all our slides and training materials here: DOE-HPC-workflow-training/FireWorks at main · CrossFacilityWorkflows/DOE-HPC-workflow-training · GitHub

For now they are still in this branch: GitHub - CrossFacilityWorkflows/DOE-HPC-workflow-training at nersc-fireworks

Thank you again,
Laurie

jpb · April 11, 2023, 11:47pm

Hi Anubhav,

Yes! Setting sleep_time=60 does the trick for me.

Also I have opened a PR to include a --json flag which lets the user pass the LaunchPad and FWorker objects to rlaunch via a json-formatted string: Enable launchpad and fireworker config be passed via json string by JBlaschke · Pull Request #499 · materialsproject/fireworks · GitHub

FireWorks already has all the machinery to format, its internals as json strings, so it wasn’t complicated.