Queue mode always submiting jobs to slurm

Hi,

Why an example as this in Python is always submiting jobs to slurm?

fw1 = Firework(ScriptTask.from_str(‘echo hello’), name=“hello”)
fw2 = Firework(ScriptTask.from_str(‘echo stuff’), name=“stuff”)
fw3 = Firework(ScriptTask.from_str(‘echo goodbye’), name=“goodbye”)
wf = Workflow([fw1, fw2, fw3], {fw1: fw2, fw2: fw3}, name=“Basic pipeline workflow”)

launchpad.add_wf(wf)
rapidfire(launchpad, FWorker(), CommonAdapter(“SLURM”, “fireworks_queue”, rocket_launch="rlaunch singleshot "), reserve=False)

Maybe I don’t understand well the queue mode but is not supposed that rapidfire listens for the Mongo to have a job and insert it in Slurm? Why with squeue a job appears every few seconds?

Thanks,

Hi @ixdi,

you are right that rapidfire through a queue pulls calculations that are ready from the launchpad until the job finishes, but this is not the whole story.
I am not completely sure how it works in the python shell, but if you do this from the command line, qlaunch rapidfire will continue to submit jobs as well unless you limit the number of jobs. On the command line this is done with the -m flag, e.g. qlaunch rapidfire -m 4 will launch jobs until 4 are in the queue. Then the command will continue to run and monitor the queue, filling it up again if a job finishes. This is of course pretty useful in high-throughput, because you will not need to re-submit jobs continuously.

If you post a fully working example with the necessary imports, I can also look at the behaviour of the python script more closely if you want. You might also want to read the relevant section in the FireWorks tutorial.

Cheers, Michael

Hi @mwo,

I attach a full example. I have created a test database only for this example in Mongo Atlas and it’s empty now (you can do lpad reset if you want). The example runs as I copy here (I will remove this database after some time). When script is executed slurm starts to accept jobs but why more jobs than the 3 scripts that conform the workflow? In my case, Slurm accepts 20 jobs and I have no idea what they represent. Maybe something I’m doing wrong.

test.py

from fireworks import Firework, Workflow, LaunchPad, ScriptTask, PyTask
from fireworks.queue.queue_launcher import rapidfire
from fireworks.user_objects.queue_adapters.common_adapter import CommonAdapter
from fireworks.core.fworker import FWorker

sc1 = ScriptTask.from_str("echo hello")
sc2 = ScriptTask.from_str("echo stuff")
sc3 = ScriptTask.from_str("echo goodbye")

fw1 = Firework(tasks=sc1)
fw2 = Firework(tasks=sc2)
fw3 = Firework(tasks=sc3)

wf = Workflow([fw1, fw2, fw3], links_dict={fw1: fw2, fw2: fw3})

MONGO_HOST="mongodb+srv://dbUser:[email protected]/test?retryWrites=true&w=majority"
launchpad = LaunchPad(uri_mode=True, host=MONGO_HOST, logdir="/shared/testing/logs", strm_lvl="INFO", name="test", port=27017)
launchpad.add_wf(wf)

rapidfire(launchpad, FWorker(), CommonAdapter("SLURM", "fireworks_queue", rocket_launch="rlaunch singleshot"), reserve=False)

FW_config.yaml

LAUNCHPAD_LOC: '/shared/my_launchpad.yaml'
FWORKER_LOC: '/shared/my_fworker.yaml'
QUEUEADAPTER_LOC: '/shared/my_qadapter.yaml'
SORT_FWS: 'FIFO'
RAPIDFIRE_SLEEP_SECS: 5

my_launchpad.yaml

authsource: admin
host: mongodb+srv://dbUser:[email protected]/test?retryWrites=true&w=majority
logdir: null
mongoclient_kwargs: {}
name: nextmol
password: null
port: null

my_qadapter.yaml

_fw_name: CommonAdapter
_fw_q_type: SLURM
rocket_launch: rlaunch singleshot
ntasks: 1
cpus_per_task: 1
ntasks_per_node: 1
walltime: '00:02:00'
queue: null
account: null
job_name: null
logdir: /shared/logs
pre_rocket: null
post_rocket: null

# You can override commands by uncommenting and changing the following lines:
# _q_commands_override:
#    submit_cmd: my_qsubmit
#    status_cmd: my_qstatus

#You can also supply your own template by uncommenting and changing the following line:
#template_file: /full/path/to/template

I’m running from a cluster in AWS.

Thank you!

Sergijobs

Hi again,

Thanks for all the info, I have not tried it myself, but I think I know now what goes “wrong” (even if it actually is doing the correct thing.) Now that I have the source of the rapidfire line at the end of your script it is easier to answer.

Look at the function definition and its docstring:

def rapidfire(launchpad, fworker, qadapter, launch_dir='.', nlaunches=0, njobs_queue=0,
              njobs_block=500, sleep_time=None, reserve=False, strm_lvl='INFO', timeout=None,
              fill_mode=False):
    """
    Submit many jobs to the queue.

    Args:
        launchpad (LaunchPad)
        fworker (FWorker)
        qadapter (QueueAdapterBase)
        launch_dir (str): directory where we want to write the blocks
        nlaunches (int): total number of launches desired; "infinite" for loop, 0 for one round
        njobs_queue (int): stops submitting jobs when njobs_queue jobs are in the queue, 0 for no limit.
            If 0 skips the check on the number of jobs in the queue.
        njobs_block (int): automatically write a new block when njobs_block jobs are in a single block
        sleep_time (int): secs to sleep between rapidfire loop iterations
        reserve (bool): Whether to queue in reservation mode
        strm_lvl (str): level at which to stream log messages
        timeout (int): # of seconds after which to stop the rapidfire process
        fill_mode (bool): whether to submit jobs even when there is nothing to run (only in
            non-reservation mode)
    """

The default for njobs_queue is 0, which means that unlimited jobs will be submitted, even if nlaunches = 0, which should limit the submission to ‘one round’ whatever that means. (I suspect it means that e.g. nlaunches = 4 and njobs_queue = 2, it will launch jobs until it has 2 in the queue, then wait for for sleep_time, and do the same thing 3 more times.) If you only want one job to run (which will execute all of your fireworks), just replace your last line with:

rapidfire(launchpad, FWorker(), CommonAdapter("SLURM", "fireworks_queue", rocket_launch="rlaunch rapidfire"), reserve=False, njobs_queue=1)

I think the main misunderstanding here is the differentiations of a job on the cluster and a Firework on the LaunchPad. The job has at the time of submission no idea about what calculations are to be run. Once it starts it simply executes a rlaunch rapidfire .... command, which starts pulling calculations from the launchpad that are in the READY state.
Also, the difference between qlaunch rapidfire (launch many jobs) and rlaunch rapidfire (execute many fireworks). I think you wanted to do qlaunch singleshot which submits one job that then executes rlaunch rapidfire on the compute node and thus deals with all 3 FWs that you have.

If you want a single job to only execute a single FW (not very practical) you can get the job also done by launching 3 jobs with qlaunch rapidfire -m 3 (or the equivalent python command with njobs_queue=3) and setting rocket_launch="rlaunch singleshot"

So it is totally fine for FireWorks to launch 20 jobs with only 3 FWs on the LaunchPad. Maybe once they start running you have already added 100 workflows with 1000 Fireworks each, and now all of these jobs have something to do! I think this is actually one of the best features of FireWorks, because you can significantly reduce total queuing time if you use it correctly.

Thank you very much for the clarifications. I understand it better now.

In AWS the idea is that the nodes are started when there is a job to be calculated, but with this Fireworks way of working they will always be started even if they have nothing to do. I will look for another way of doing. Thanks again.

As I said, it is really easy to only start a single job, but if you are not really running high-throughput with a lot of tasks to complete, probably FireWorks is not ideal.