How to use two queue adapters for FireWorks with depdencies?

Dear FireWorks developers and community,

I’m putting together a FireWorks demo for an upcoming workshop at NERSC (Advertising Workflows/FireWorks training at NERSC April 12, 2023) and I have a question for you all.

I’d like to demonstrate a 3 step workflow with multiple FireWorks with dependencies:

fws:

  • fw_id: 1
    spec:
    _tasks:
    • _fw_name: step1
      script: srun step_1_diabetes_preprocessing.py
  • fw_id: 2
    spec:
    _tasks:
    • _fw_name: step2
      script: srun step_2_diabetes_correlation.py
  • fw_id: 3
    spec:
    _tasks:
    • _fw_name: step3
      script: srun step_3_diabetes_postprocessing.py
      links:
      1:
    • 2
      2:
    • 3
      metadata: {}
      ~

Additionally, I’d like to use two queue adapters so that Step 2 can use more resources than Step 1 and Step 3:

  • Step 1, 1 node using my_queueadapter1.yaml
  • Step 2, 2 nodes using my_queueadapter2.yaml
  • Step 3, 1 node using my_queueadapter1.yaml

I am struggling with how to accomplish both of these things in a single FireWorks workflow. I see that this situation is described here in step 3:

but I don’t understand how I should set up the workflow to accomplish this. Adding the _queueadapter parameter to each entry in the spec would be a nice option, but I see that only works in reservation mode. My other idea was to write a separate fw_task.yaml for each task, and add to the queue like

qlaunch singleshot -q my_queueadapter1.yaml step_1_fw.yaml
qlaunch singleshot -q my_queueadapter2.yaml step_2_fw.yaml
qlaunch singleshot -q my_queueadapter1.yaml step_3_fw.yaml

but the issue is that I don’t know how to specify dependencies between tasks if I do it like this.

Can you please offer some advice? I’d really appreciate any help or examples you could point me towards.

Thank you very much,
Laurie

Hi Laurie,

You should be able to set up each of the queue adapter files to use different fireworkers. The fireworkers in turn determine which jobs to pull.

First, you want to set up two my_fworker.yaml files. The first my_fworker1.yaml file is configured to only pull FWs like Step 1 and Step 3. The second my_fworker2.yaml file is configured to only pull FWs like Step 2. You can set up these two my_fworker1.yaml and my_fworker2.yaml files using guidance in: Controlling the directory and Worker of execution — FireWorks 2.0.3 documentation . See the section about “Controlling the Worker that executes a Firework” for options on how to connect my_fworker1.yaml to FWs 1 and 3, and my_fworker2.yaml to FW 2.

Next, you want your qadapter files to be linked to the appropriate fireworkers. That is, my_qadapter1.yaml should use my_fworker1.yaml so it only pulls jobs like FWs 1 and 3, and my_qadapter2.yaml should use my_fworker2.yaml so it only pulls jobs like FW 2. To do this you should do two things:

  1. When executing qlaunch, use both the -w and -q options together with the right correspondence. e.g., qlaunch -q my_qadapter1.yaml -w my_fworker1.yaml ...
    qlaunch -q my_qadapter2.yaml -w my_fworker2.yaml ...

  2. Within each my_qadapter.yaml file, make sure that when the job “wakes” up it is only pulling jobs from the correct fworker. So the rlaunch command inside my_qadapter1.yaml should contain the -w my_fworker1.yaml option, and the rlaunch command inside my_qadapter2.yaml should contain the ``my_fworker2.yaml``` option

Hopefully this all makes sense, let us know how it goes

2 Likes

Btw, the other option (if you just want to execute a single qlaunch command and not two of them) is to use reservation mode, but you mentioned that wasn’t the solution you wanted so I didn’t cover that

Dear Anubhav,

Thank you very much for your fast and detailed reply, especially on a Friday afternoon!

I think I understood the second part of what you mentioned better than the first. I created a my_qadapter1.yaml and a my_qdapter2.yaml and made sure to include the name of the appropriate my_fworker.yaml file for each:

low-training/FireWorks/NERSC> cat my_qadapter1.yaml 
_fw_name: CommonAdapter
_fw_q_type: SLURM
rocket_launch: rlaunch -w my_fworker1.yaml -l my_launchpad.yaml singleshot
constraint: cpu
nodes: 1
ntasks: 1
account: nstaff
walltime: '00:05:00'
queue: debug
job_name: null
logdir: null
pre_rocket: null
post_rocket: null

I think this part makes sense, but please let me know if I got it wrong.

Next, I have created both a my_fworker1.yaml and a my_fworker2.yaml. I am a bit unsure of whether I have set _fworker and name correctly. I am trying to follow Method 1 on the “Controlling the worker” page:

low-training/FireWorks/NERSC> cat my_fworker1.yaml 
fws:
- fw_id: 1
  name: one_node
  spec:
    _fworker: one_node
    _tasks:
    - _fw_name: ScriptTask
      script: srun step_1_diabetes_preprocessing.py
- fw_id: 3
  name: one_node
  spec:
    _fworker: one_node
    _tasks:
    - _fw_name: ScriptTask
      script: srun step_3_diabetes_postprocessing.py
links:
  1:
  - 2
  2:
  - 3
metadata: {}

I wasn’t sure if the name/_fworker combination should be the same or different for tasks 1 and 3. Right now I am getting a key error with this setup, so I think what I’ve done is wrong:

low-training/FireWorks/NERSC> qlaunch -q my_qadapter1.yaml -w my_fworker1.yaml

# cutting out most of the traceback

  File "/global/common/software/das/stephey/conda/conda_envs/fireworks/lib/python3.9/site-packages/fireworks/core/fworker.py", line 55, in from_dict
    return FWorker(m_dict["name"], m_dict["category"], json.loads(m_dict["query"]), m_dict.get("env"))
KeyError: 'name'

Finally for completeness, here is my my_fworker2.yaml

low-training/FireWorks/NERSC> cat my_fworker2.yaml
fws:
- fw_id: 2
  name: step2
  spec:
    - _fworker: step2
    _tasks:
    - _fw_name: ScriptTask
      script: srun step_2_diabetes_correlation.py
links:
  1:
  - 2
  2:
  - 3
metadata: {}

Final question- have I handled the dependencies correctly? Now that I’m splitting up the tasks in two files, I wasn’t sure if I needed to modify the links part of the file.

Thank you again for your help,
Laurie

Hi Laurie,

The way you have set up the my_fworker1.yaml file, I believe there may be some confusion between FireWorks and FireWorkers (they sound really similar, they are actually very different …)

A FireWorker does two things:

  • controls what jobs should be run on a machine
  • gives some machine-specific settings that may be needed for a job

It does not have scripts to run, tasks, or anything else like that. So the first thing you need to do is to clean up the my_fworker1.yaml file so it doesn’t have a workflow in it and start that file from scratch. You should be able to look at:

https://materialsproject.github.io/fireworks/worker_tutorial.html

To see how to set up a basic my_fworker.yaml file. Then, pick one of the methods I mentioned for connecting the specific FireWorks with the associated FireWorker. That link again is:

https://materialsproject.github.io/fireworks/controlworker.html

There are a few different ways to accomplish what you mentioned, to make things simple I suggest the “category” method where there are two categories of jobs.

Hopefully those links plus the knowledge that the my_fworker.yaml file should be completely different than the workflow definitions should get you on the right track …

1 Like

Dear Anubhav,

Thank you very much for your quick reply and helpful information!!! I really appreciate it.

You were right- I was fundamentally misunderstanding FireWork vs FireWorker. Thank you for helping fix this and come to the correct understanding that the FireWorker is kind of a job configuration setting file, and the FireWork describes the workflow itself.

With this missing piece, I was finally able to run my desired workflow, so thank you!

I do have one final question, though. With my current setup, I noticed that I have to issue 3 qlaunch commands to get all of my FireWorks to run.

qlaunch -q my_qadapter1.yaml -w my_fworker1.yaml singleshot #launches step 1
qlaunch -q my_qadapter2.yaml -w my_fworker2.yaml singleshot #launches step 2
qlaunch -q my_qadapter1.yaml -w my_fworker1.yaml singleshot #needed to launch step 3, which is in state WAITING

I was naively expecting that I would only have to issue two qlaunches, one for each queue adaptor configuration. Is this a consequence of the workflow structure I have chosen, or is there some additional configuration setting/strategy I should use?

For completeness, here is my whole workflow:

stephey@perlmutter:login02:~> module load python
stephey@perlmutter:login02:~> conda activate fireworks
(/global/common/software/das/stephey/conda/conda_envs/fireworks) stephey@perlmutter:login02:~> cd /pscratch/sd/s/stephey/DOE-HPC-workflow-training/FireWorks/NERSC
(/global/common/software/das/stephey/conda/conda_envs/fireworks) stephey@perlmutter:login02:/pscratch/sd/s/stephey/DOE-HPC-workflow-training/FireWorks/NERSC> lpad reset
Are you sure? This will RESET 1 workflows and all data. (Y/N)y
2023-04-02 20:52:07,940 INFO Performing db tune-up
2023-04-02 20:52:08,001 INFO LaunchPad was RESET.
(/global/common/software/das/stephey/conda/conda_envs/fireworks) stephey@perlmutter:login02:/pscratch/sd/s/stephey/DOE-HPC-workflow-training/FireWorks/NERSC> cat fw_diabetes_wf.yaml 
fws:
- fw_id: 1
  spec:
    _category: onenode
    _tasks:
    - _fw_name: ScriptTask
      script: srun python step_1_diabetes_preprocessing.py
- fw_id: 2
  spec:
    _category: twonode
    _tasks:
    - _fw_name: ScriptTask
      script: srun -n 10 python step_2_diabetes_correlation.py
- fw_id: 3
  spec:
    _category: onenode
    _tasks:
    - _fw_name: ScriptTask
      script: srun python step_3_diabetes_postprocessing.py
links:
  1:
  - 2
  2:
  - 3
metadata: {}
(/global/common/software/das/stephey/conda/conda_envs/fireworks) stephey@perlmutter:login02:/pscratch/sd/s/stephey/DOE-HPC-workflow-training/FireWorks/NERSC> lpad add fw_diabetes_wf.yaml
2023-04-02 20:59:16,444 INFO Added a workflow. id_map: {1: 1, 2: 2, 3: 3}
(/global/common/software/das/stephey/conda/conda_envs/fireworks) stephey@perlmutter:login02:/pscratch/sd/s/stephey/DOE-HPC-workflow-training/FireWorks/NERSC> cat my_fworker1.yaml
name: one node fireworker
category: onenode
query: '{}'
(/global/common/software/das/stephey/conda/conda_envs/fireworks) stephey@perlmutter:login02:/pscratch/sd/s/stephey/DOE-HPC-workflow-training/FireWorks/NERSC> cat my_qadapter1.yaml 
_fw_name: CommonAdapter
_fw_q_type: SLURM
rocket_launch: rlaunch -w my_fworker1.yaml -l my_launchpad.yaml singleshot
constraint: cpu
nodes: 1
ntasks: 1
account: nstaff
walltime: '00:05:00'
queue: debug
job_name: null
logdir: null
pre_rocket: null
post_rocket: null
(/global/common/software/das/stephey/conda/conda_envs/fireworks) stephey@perlmutter:login02:/pscratch/sd/s/stephey/DOE-HPC-workflow-training/FireWorks/NERSC> qlaunch -q my_qadapter1.yaml -w my_fworker1.yaml singleshot
2023-04-02 21:00:33,387 INFO moving to launch_dir /pscratch/sd/s/stephey/DOE-HPC-workflow-training/FireWorks/NERSC
2023-04-02 21:00:33,390 INFO submitting queue script
2023-04-02 21:00:34,003 INFO Job submission was successful and job_id is 6895033
(/global/common/software/das/stephey/conda/conda_envs/fireworks) stephey@perlmutter:login02:/pscratch/sd/s/stephey/DOE-HPC-workflow-training/FireWorks/NERSC> lpad get_fws
[
    {
        "fw_id": 1,
        "created_on": "2023-04-03T03:59:16.437426",
        "updated_on": "2023-04-03T04:00:43.350942",
        "state": "COMPLETED",
        "name": "Unnamed FW"
    },
    {
        "fw_id": 2,
        "created_on": "2023-04-03T03:59:16.437581",
        "updated_on": "2023-04-03T04:00:43.352286",
        "state": "READY",
        "name": "Unnamed FW"
    },
    {
        "fw_id": 3,
        "created_on": "2023-04-03T03:59:16.437671",
        "updated_on": "2023-04-03T03:59:16.437671",
        "name": "Unnamed FW",
        "state": "WAITING"
    }
]
(/global/common/software/das/stephey/conda/conda_envs/fireworks) stephey@perlmutter:login02:/pscratch/sd/s/stephey/DOE-HPC-workflow-training/FireWorks/NERSC> qlaunch -q my_qadapter2.yaml -w my_fworker2.yaml singleshot
2023-04-02 21:01:03,996 INFO moving to launch_dir /pscratch/sd/s/stephey/DOE-HPC-workflow-training/FireWorks/NERSC
2023-04-02 21:01:03,997 INFO submitting queue script
2023-04-02 21:01:04,101 INFO Job submission was successful and job_id is 6895035
(/global/common/software/das/stephey/conda/conda_envs/fireworks) stephey@perlmutter:login02:/pscratch/sd/s/stephey/DOE-HPC-workflow-training/FireWorks/NERSC> lpad get_fws
[
    {
        "fw_id": 1,
        "created_on": "2023-04-03T03:59:16.437426",
        "updated_on": "2023-04-03T04:00:43.350942",
        "state": "COMPLETED",
        "name": "Unnamed FW"
    },
    {
        "fw_id": 2,
        "created_on": "2023-04-03T03:59:16.437581",
        "updated_on": "2023-04-03T04:01:09.537061",
        "state": "RUNNING",
        "name": "Unnamed FW"
    },
    {
        "fw_id": 3,
        "created_on": "2023-04-03T03:59:16.437671",
        "updated_on": "2023-04-03T03:59:16.437671",
        "name": "Unnamed FW",
        "state": "WAITING"
    }
]
(/global/common/software/das/stephey/conda/conda_envs/fireworks) stephey@perlmutter:login02:/pscratch/sd/s/stephey/DOE-HPC-workflow-training/FireWorks/NERSC> qlaunch -q my_qadapter1.yaml -w my_fworker1.yaml singleshot
2023-04-02 21:01:22,716 INFO moving to launch_dir /pscratch/sd/s/stephey/DOE-HPC-workflow-training/FireWorks/NERSC
2023-04-02 21:01:22,718 INFO submitting queue script
2023-04-02 21:01:22,810 INFO Job submission was successful and job_id is 6895040
(/global/common/software/das/stephey/conda/conda_envs/fireworks) stephey@perlmutter:login02:/pscratch/sd/s/stephey/DOE-HPC-workflow-training/FireWorks/NERSC> lpad get_fws
[
    {
        "fw_id": 1,
        "created_on": "2023-04-03T03:59:16.437426",
        "updated_on": "2023-04-03T04:00:43.350942",
        "state": "COMPLETED",
        "name": "Unnamed FW"
    },
    {
        "fw_id": 2,
        "created_on": "2023-04-03T03:59:16.437581",
        "updated_on": "2023-04-03T04:01:11.994916",
        "state": "COMPLETED",
        "name": "Unnamed FW"
    },
    {
        "fw_id": 3,
        "created_on": "2023-04-03T03:59:16.437671",
        "updated_on": "2023-04-03T04:01:30.222269",
        "state": "COMPLETED",
        "name": "Unnamed FW"
    }
]
(/global/common/software/das/stephey/conda/conda_envs/fireworks) stephey@perlmutter:login02:/pscratch/sd/s/stephey/DOE-HPC-workflow-training/FireWorks/NERSC> 

Thank you again,
Laurie

Hi Laurie,

As always, there are a few different ways to do things depending on what exactly you want to do. The two ways to minimize qlaunch commands are:

(1) Have the qlaunch command submit more than one job and/or
(2) run multiple Fireworks per queue submission, so that a single queued job runs multiple things

The simplest solution to understand is (1) so let’s start with that. If you replace qlaunch singleshot with qlaunch rapidfire, it will submit more than 1 job to the queue. The rapidfire command can also do things like loop, e.g. submit a set number of jobs to the queue, wait a while, and then submit more - or maintain a certain number of jobs in the queue. See the help docs via qlaunch rapidfire -h for more details on the various options. Unfortunately, when you don’t run in reservation mode, there is not really a 1:1 correspondence between job submissions and items to run in the database. So you may end up submitting many more queued jobs than actually can be run, and those jobs will wake up at NERSC, find nothing to run, and then quit. That is something to just have to live with if you are not running in reservation mode. Some people actually prefer this function because they can submit a bunch of jobs to NERSC, and while those jobs are aging, can populate and/or modify their database of jobs. Anyway, solution #1 is to use qlaunch rapidfire.

Solution #2 is to run multiple jobs per queued submission. You can do this by replacing rlaunch singleshot in your my_queueadapter.yaml file to rlaunch rapidfire. Although, in your case this will be a bit trickier to get right because you are alternating nodes. i.e. you onenode job will wake up, launch FW1 on that node, and then technically it needs to wait around for FW2 to complete on the twonode job before it can proceed to running FW3 on that node. Depending on the rlaunch rapidfire parameters you use, it might instead quit after runinng FW1 becuase there are no eligible FWs to run (FW3 is ineligible because it depends on FW2 completing).

Anyway, hopefully this gets you in the right direction. A lot of this is complicated to explain as a set of instructions, but easier if you can understand the fundamentals of what is going on under the hood with qlaunch and rlaunch.

1 Like

Dear Anubhav,

Thanks for this information and apologies for the long delay in my reply. I think I have a better understanding of the relationship between qlaunch and rlaunch and also singleshot/rapidfire.

If you are interested, you can see the first draft of the FireWorks NERSC tutorial materials we have prepared for our training event on April 12.

I have a few hopefully final questions. I would like to demonstrate launching this workflow with the first option you described in your previous post (qlaunch rapidfire). I think this option will be appealing to most NERSC users since it makes more efficient use of their node hours. My idea was to launch it with something like

lpad reset
lpad add fw_diabetes_wf.yaml
qlaunch -q my_qadapter1.yaml -w my_fworker1.yaml rapidfire & qlaunch -q my_qadapter2.yaml -w my_fworker2.yaml rapidfire

When I do it like this though, the first task is COMPLETED but the second task never makes it past READY. Is what I’m doing conceptually wrong with two rapidfire commands like this?

I have also experimented with running the second command with qlaunch singleshot, and this does work. However, I can’t say I like this option very much since we have to wait to make sure the first task has completed. Is mixing qlaunch rapidfire with qlaunch singleshot the best option for this case, or is there a better approach?

Thank you very much,
Laurie

Hi Anubhav,

Actually I think I answered my own question. I think the issue was that I was specifying the my_fworker.yaml file both on the command line and in the queue adapter- that was maybe causing a problem.

Doing this seems to do exactly what I wanted:

lpad reset
lpad add fw_diabetes_wf.yaml 
qlaunch -q my_qadapter1.yaml rapidfire & qlaunch -q my_qadapter2.yaml rapidfire

All of my FireWorks launched and completed successfully with this set of commands.

Further advice is welcome, but thank you so much! I really appreciate all your help in putting together this demo for our tutorial.

Best regards,
Laurie

I’m delighted to see @lastephey here – merely hours before I start googling the same question :smiley: Thank you so much, you and @Anubhav_Jain have helped me figure this out.

In case this is of interest to you or any other readers, I am doing basically the same thing as Laurie – but in python. The application I’m working has job parameters that are dynamically generated (e.g. if a run is big, then we might want a longer walltime, or more resourced), so building and submitting these jobs from python seems the natural thing to do.

Here is a basic example which launches 2 MPI programs (in a linear DAG), with the second program needing twice as many nodes as the first: alcc-recipes/fireworks/test_3_python at jpb/fireworks · JBlaschke/alcc-recipes · GitHub

One more thing I want to build is a generator for the my_fworker_$n.yaml – I just spend 30 mins not seeing a typo there. Is there a way to specify the FireWorker on the command line instead of a specific file? E.g. instead of:

rocket_launch="rlaunch -l my_launchpad.yaml -w my_fworker_1.yaml singleshot"

have something like:

rocket_launch=f"rlaunch -l my_launchpad.yaml -w {fworker_str} singleshot"

where fworker_str contains a string representation of the FWorker class?

I think I spoke too soon – my second launch_rocket_to_queue here:

launch_rocket_to_queue(launchpad, FWorker(name="mpi_4_fworker", category="n4"), qadapter_2)

raises the error:

2023-04-09 23:24:19,789 INFO No jobs exist in the LaunchPad for submission to queue!

unless there is already a FireWork that satisfies its requirements. Is it possible that when the workflow is initialized, the second FirewTask is in the waiting state (as it should be as it depends on the first one). But for some reason the FirewWorker doesn’t wait around to be ready for the second firetask.

Any advice would be appreciated.

One more observation: If I add multiple workflows to the launchpad, I noticed that the slurm jobs quit after the first workflow is done – Sp this made me re-evaluated launch_rocket_to_queue maybe not the right thing to do here.

So I switched to rapdifire instead. Works like a charm :slight_smile: – I noticed that it has quite a greedy strategy. It “launches” waaaay more rockets than it needs. So while jobs are waiting in the queue, new jobs (which will never be needed) are being added. I guess this is nice, because those jobs can wait in the line for you, while you prepare new ones. But ideally I would want to dial this back a bit (think of the situation where a large job would take hours to start, resulting in thousands of “dud” jobs in the queue also).

Hi,

I’m trying to figure out what’s resolved and what’s still in question, but here’s my best attempt:

  1. Sorry, I don’t know a way in the current code to dynamically set the FireWorker via a string using the command line. But, instead of running a bash command to run qlaunch, you could instead write a python script that runs the qlaunch rapidfire command and run the Python script. In the Python script you can do whatever you want including setting up different FireWorkers dynamically. Note that the bottom of most tutorials contain Python examples

  2. If there are too many jobs being launched, you have two options. The first one is to use either the nlaunches or maxjobs_queue parameters to control the number of things being submitted, see qlaunch rapidfire -h for details. The second one is to use reservation mode, see the docs for details

Laurie, thank you for putting together the tutorial. Regrettably I don’t know if I’ll be able to review it in time but I’ll pass it along in case someone else is able to take a look

Btw one more note - if you find yourself dynamically creating FireWorkers, it’s highly likely that the better solution is to run in reservation mode and have the fireworks themselves include the queue parameters they need. Then a single qlaunch/FireWorker can handle heterogeneous queue parameters. More info in the docs, however for this thread I believe the OP specified they didn’t want to use reservation mode.

Hi Anubhav,

Right- since we are teaching FireWorks to what we expect is a novice audience, I didn’t want to cover reservation mode.

No need to provide any feedback- we just wanted to share with you what we’ve developed. We’re continuing to update our repo, but you’ll eventually be able to find all our slides and training materials here: DOE-HPC-workflow-training/FireWorks at main · CrossFacilityWorkflows/DOE-HPC-workflow-training · GitHub

For now they are still in this branch: GitHub - CrossFacilityWorkflows/DOE-HPC-workflow-training at nersc-fireworks

Thank you again,
Laurie

Hi Anubhav,

Yes! Setting sleep_time=60 does the trick for me.

Also I have opened a PR to include a --json flag which lets the user pass the LaunchPad and FWorker objects to rlaunch via a json-formatted string: Enable launchpad and fireworker config be passed via json string by JBlaschke · Pull Request #499 · materialsproject/fireworks · GitHub

FireWorks already has all the machinery to format, its internals as json strings, so it wasn’t complicated.