Queue mode always submiting jobs to slurm

Hi,

Why an example as this in Python is always submiting jobs to slurm?

fw1 = Firework(ScriptTask.from_str(‘echo hello’), name=“hello”)
fw2 = Firework(ScriptTask.from_str(‘echo stuff’), name=“stuff”)
fw3 = Firework(ScriptTask.from_str(‘echo goodbye’), name=“goodbye”)
wf = Workflow([fw1, fw2, fw3], {fw1: fw2, fw2: fw3}, name=“Basic pipeline workflow”)

launchpad.add_wf(wf)
rapidfire(launchpad, FWorker(), CommonAdapter(“SLURM”, “fireworks_queue”, rocket_launch="rlaunch singleshot "), reserve=False)

Maybe I don’t understand well the queue mode but is not supposed that rapidfire listens for the Mongo to have a job and insert it in Slurm? Why with squeue a job appears every few seconds?

Thanks,

Hi @ixdi,

you are right that rapidfire through a queue pulls calculations that are ready from the launchpad until the job finishes, but this is not the whole story.
I am not completely sure how it works in the python shell, but if you do this from the command line, qlaunch rapidfire will continue to submit jobs as well unless you limit the number of jobs. On the command line this is done with the -m flag, e.g. qlaunch rapidfire -m 4 will launch jobs until 4 are in the queue. Then the command will continue to run and monitor the queue, filling it up again if a job finishes. This is of course pretty useful in high-throughput, because you will not need to re-submit jobs continuously.

If you post a fully working example with the necessary imports, I can also look at the behaviour of the python script more closely if you want. You might also want to read the relevant section in the FireWorks tutorial.

Cheers, Michael

Hi @mwo,

I attach a full example. I have created a test database only for this example in Mongo Atlas and it’s empty now (you can do lpad reset if you want). The example runs as I copy here (I will remove this database after some time). When script is executed slurm starts to accept jobs but why more jobs than the 3 scripts that conform the workflow? In my case, Slurm accepts 20 jobs and I have no idea what they represent. Maybe something I’m doing wrong.

test.py

from fireworks import Firework, Workflow, LaunchPad, ScriptTask, PyTask
from fireworks.queue.queue_launcher import rapidfire
from fireworks.user_objects.queue_adapters.common_adapter import CommonAdapter
from fireworks.core.fworker import FWorker

sc1 = ScriptTask.from_str("echo hello")
sc2 = ScriptTask.from_str("echo stuff")
sc3 = ScriptTask.from_str("echo goodbye")

fw1 = Firework(tasks=sc1)
fw2 = Firework(tasks=sc2)
fw3 = Firework(tasks=sc3)

wf = Workflow([fw1, fw2, fw3], links_dict={fw1: fw2, fw2: fw3})

MONGO_HOST="mongodb+srv://dbUser:[email protected]/test?retryWrites=true&w=majority"
launchpad = LaunchPad(uri_mode=True, host=MONGO_HOST, logdir="/shared/testing/logs", strm_lvl="INFO", name="test", port=27017)
launchpad.add_wf(wf)

rapidfire(launchpad, FWorker(), CommonAdapter("SLURM", "fireworks_queue", rocket_launch="rlaunch singleshot"), reserve=False)

FW_config.yaml

LAUNCHPAD_LOC: '/shared/my_launchpad.yaml'
FWORKER_LOC: '/shared/my_fworker.yaml'
QUEUEADAPTER_LOC: '/shared/my_qadapter.yaml'
SORT_FWS: 'FIFO'
RAPIDFIRE_SLEEP_SECS: 5

my_launchpad.yaml

authsource: admin
host: mongodb+srv://dbUser:[email protected]/test?retryWrites=true&w=majority
logdir: null
mongoclient_kwargs: {}
name: nextmol
password: null
port: null

my_qadapter.yaml

_fw_name: CommonAdapter
_fw_q_type: SLURM
rocket_launch: rlaunch singleshot
ntasks: 1
cpus_per_task: 1
ntasks_per_node: 1
walltime: '00:02:00'
queue: null
account: null
job_name: null
logdir: /shared/logs
pre_rocket: null
post_rocket: null

# You can override commands by uncommenting and changing the following lines:
# _q_commands_override:
#    submit_cmd: my_qsubmit
#    status_cmd: my_qstatus

#You can also supply your own template by uncommenting and changing the following line:
#template_file: /full/path/to/template

I’m running from a cluster in AWS.

Thank you!

Sergijobs

Hi again,

Thanks for all the info, I have not tried it myself, but I think I know now what goes “wrong” (even if it actually is doing the correct thing.) Now that I have the source of the rapidfire line at the end of your script it is easier to answer.

Look at the function definition and its docstring:

def rapidfire(launchpad, fworker, qadapter, launch_dir='.', nlaunches=0, njobs_queue=0,
              njobs_block=500, sleep_time=None, reserve=False, strm_lvl='INFO', timeout=None,
              fill_mode=False):
    """
    Submit many jobs to the queue.

    Args:
        launchpad (LaunchPad)
        fworker (FWorker)
        qadapter (QueueAdapterBase)
        launch_dir (str): directory where we want to write the blocks
        nlaunches (int): total number of launches desired; "infinite" for loop, 0 for one round
        njobs_queue (int): stops submitting jobs when njobs_queue jobs are in the queue, 0 for no limit.
            If 0 skips the check on the number of jobs in the queue.
        njobs_block (int): automatically write a new block when njobs_block jobs are in a single block
        sleep_time (int): secs to sleep between rapidfire loop iterations
        reserve (bool): Whether to queue in reservation mode
        strm_lvl (str): level at which to stream log messages
        timeout (int): # of seconds after which to stop the rapidfire process
        fill_mode (bool): whether to submit jobs even when there is nothing to run (only in
            non-reservation mode)
    """

The default for njobs_queue is 0, which means that unlimited jobs will be submitted, even if nlaunches = 0, which should limit the submission to ‘one round’ whatever that means. (I suspect it means that e.g. nlaunches = 4 and njobs_queue = 2, it will launch jobs until it has 2 in the queue, then wait for for sleep_time, and do the same thing 3 more times.) If you only want one job to run (which will execute all of your fireworks), just replace your last line with:

rapidfire(launchpad, FWorker(), CommonAdapter("SLURM", "fireworks_queue", rocket_launch="rlaunch rapidfire"), reserve=False, njobs_queue=1)

I think the main misunderstanding here is the differentiations of a job on the cluster and a Firework on the LaunchPad. The job has at the time of submission no idea about what calculations are to be run. Once it starts it simply executes a rlaunch rapidfire .... command, which starts pulling calculations from the launchpad that are in the READY state.
Also, the difference between qlaunch rapidfire (launch many jobs) and rlaunch rapidfire (execute many fireworks). I think you wanted to do qlaunch singleshot which submits one job that then executes rlaunch rapidfire on the compute node and thus deals with all 3 FWs that you have.

If you want a single job to only execute a single FW (not very practical) you can get the job also done by launching 3 jobs with qlaunch rapidfire -m 3 (or the equivalent python command with njobs_queue=3) and setting rocket_launch="rlaunch singleshot"

So it is totally fine for FireWorks to launch 20 jobs with only 3 FWs on the LaunchPad. Maybe once they start running you have already added 100 workflows with 1000 Fireworks each, and now all of these jobs have something to do! I think this is actually one of the best features of FireWorks, because you can significantly reduce total queuing time if you use it correctly.

Thank you very much for the clarifications. I understand it better now.

In AWS the idea is that the nodes are started when there is a job to be calculated, but with this Fireworks way of working they will always be started even if they have nothing to do. I will look for another way of doing. Thanks again.

As I said, it is really easy to only start a single job, but if you are not really running high-throughput with a lot of tasks to complete, probably FireWorks is not ideal.

Hello. I am trying to run the fireworks command via python. My code is similar to the one you mentioned above. I have called all the necessary libraries. But when I launch the job through python. I am getting the job Id from the cluster, but in the error file generated after launching I am getting an error ‘bash: rlaunch command not found’ Any idea what can be the problem here?

Hi, I would suspect that the FireWorks installation is not correct on the cluster if the rlaunch command is not available. Maybe you have to load your virtual python environment on the compute node? Did you use a virtual environment to install FireWorks? Did you configure everything correctly (e.g. my_qadapter.yaml) Can you maybe try to start an interactive session on your compute node and check if your rlaunch and lpad commands work there?

Generally it would help if you could provide more context about what you are trying to do, and how you are doing it.

Yes, I have used virtual environment to install Fireworks. And yes, all the commands (lpad add, rlaunch, qlaunch etc) work fine when I run from the linux commandline. And all the commands works fine and I have completed tasks via commandline. My task, is I am trying to do task such as adding fw through python for example launchpad.add_wf(firework) , rlaunch: launch_rocket(launchpad) etc. These both works fime . Now I need to perform qlaunch I menat need to ubmit the jobs to the supercomputers through slurm using qadapter. I managed to submit the jobs to cluster via python and my code return the job id from cluster. But in the FW error file generated it shows bash: rlaunch command not found. I guess the rlaunch command in my qadapter file is not identified

This is my qadapter file and same file I use to run the qlaunch through the commandline and it works fine

Hi, thanks for the additional info.

Just so I am 100% positive I understand:

  • You can add workflows to the launchpad through python.
  • You successfully used rlaunch to start executing jobs from the command line.
  • You where also able to use a python script with a qlaunch command, and the job gets submitted correctly.
  • The job ID is returned by your code, but the job fails with:
    bash: rlaunch command not found

To me this still looks like your virtual environment is not working on the compute node, but it is working on the login node of the cluster. Are you running your python scripts and rlaunch commands on a login node?

I would try and modify the pre_rocket (and post_rocket) field in your qadapter file to activate the virual environment and any other modules you might need to run your code. I for example load some modules and activate the conda environment (TriboFlow) where I have FireWorks installed. I deactivate the environment and then unload the modules after the rlaunch command has finished:

pre_rocket: module purge; module load intel/19.0.4 intel-mpi/2019.4 intel-mkl/2019.4; conda activate TriboFlow
post_rocket: conda deactivate; module purge

This leads to a job file containing this after some other #SBATCH lines:

:
:
#SBATCH --output=TriboFlow.out
#SBATCH --error=TriboFlow.error

module purge; module load intel/19.0.4 intel-mpi/2019.4 intel-mkl/2019.4; conda activate TriboFlow
cd /gpfs/data/fs71411/mwo3/WORK/block_2021-07-02-15-04-31-814900/launcher_2021-07-19-12-13-15-820625
rlaunch -c /home/fs71411/mwo3/FireWorks/config rapidfire --timeout 172800
conda deactivate; module purge

# CommonAdapter (SLURM) completed writing Template

I hope that this is the issue, and it really looks like it, since your cluster cannot find the rlaunch command.

Cheers, Michael

Yes, you understood the issue correctly. Thank you so much for the solution and for your time. I will try it out. :slight_smile:

Thank me once you tried it and it worked! :wink:

Thank you so so much. It worked :slight_smile: :+1:

1 Like