Trouble with qlaunch tutorial on remote server

Hi,

I am trying to get fireworks qlaunch running on a remote server and having issues . The remote server uses slurm to manage the jobs. I am working through the queue tutorial (found here Launch Rockets through a queue — FireWorks 1.9.7 documentation).

I have set up the directory as specified in the tutorial, but when try to run qlaunch singleshot I get the following error.

$ qlaunch singleshot
Found many potential paths for LAUNCHPAD_LOC: ['/home/dda/fw_things/fun/queue/my_launchpad.yaml', '/home/dda/fw_things/fireworks/my_launchpad.yaml']
Choosing as default: /home/dda/fw_things/fun/queue/my_launchpad.yaml
Traceback (most recent call last):
  File "/home/dda/miniconda3/envs/fw37/bin/qlaunch", line 33, in <module>
    sys.exit(load_entry_point('FireWorks', 'console_scripts', 'qlaunch')())
  File "/home/dda/fw_things/fireworks/fireworks/scripts/qlaunch_run.py", line 224, in qlaunch
    do_launch(args)
  File "/home/dda/fw_things/fireworks/fireworks/scripts/qlaunch_run.py", line 62, in do_launch
    queueadapter = load_object_from_file(args.queueadapter_file)
  File "/home/dda/fw_things/fireworks/fireworks/utilities/fw_serializers.py", line 391, in load_object_from_file
    f_format = filename.split('.')[-1]
AttributeError: 'NoneType' object has no attribute 'split'

I have previously had success running simple, single-core jobs on the remote server by connecting to mongodb atlas database. I have verified that my configurations are valid and I am able to add and launch fireworks using lpad and launch commands.
For example:

(fw37) dda at hpc → [~/fw_things/fun/queue]
$ lpad get_fws
Found many potential paths for LAUNCHPAD_LOC: ['/home/dda/fw_things/fun/queue/my_launchpad.yaml', '/home/dda/fw_things/fireworks/my_launchpad.yaml']
Choosing as default: /home/dda/fw_things/fun/queue/my_launchpad.yaml
{
    "fw_id": 1,
    "created_on": "2021-04-16T21:37:31.502195",
    "updated_on": "2021-04-16T21:37:31.502442",
    "state": "READY",
    "name": "Unnamed FW"
}

I am currently running python 3.7 because 3.8 and 3.9 had some issues with fireworks.

$ python
Python 3.7.9 (default, Aug 31 2020, 12:42:55) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 

I have also tried to run a simple fireworks command with slurm. That also produces no error and is able to connect to the remote database.

$ cat slurm_test.sh 
#!/bin/bash
#SBATCH -p high,med,low
#SBATCH --job-name=XX
#SBATCH --output=job.out
#SBATCH --error=job.err
#SBATCH --time=00:01:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1 
lpad get_fws
(fw37) dda at hpc1 → [~/fw_things/fun/queue]
$ cat job.out 
==========================================
SLURM_JOB_ID = 5273485
SLURM_NODELIST = c7-31
CUDA_VISIBLE_DEVICES = 
==========================================
Found many potential paths for LAUNCHPAD_LOC: ['/home/dda/fw_things/fun/queue/my_launchpad.yaml', '/home/dda/fw_things/fireworks/my_launchpad.yaml']
Choosing as default: /home/dda/fw_things/fun/queue/my_launchpad.yaml
{
    "fw_id": 1,
    "created_on": "2021-04-16T21:24:53.824024",
    "updated_on": "2021-04-16T21:24:53.824383",
    "state": "READY",
    "name": "Unnamed FW"
}

===========================================================================
Job Finished

Name                : XX
User                : dda
Partition           : high
Nodes               : c7-31
Cores               : 1
State               : COMPLETED
Submit              : 2021-04-16T14:33:02
Start               : 2021-04-16T14:33:02
End                 : 2021-04-16T14:33:09
Reserved walltime   : 00:01:00
Used walltime       : 00:00:07
Used CPU time       : 00:00:01
% User (Computation): 73.21%
% System (I/O)      : 26.72%
Mem reserved        : 0/node
Max Mem used        : 0.00  (c7-31)
Max Disk Write      : 0.00  (c7-31)
Max Disk Read       : 0.00  (c7-31)

I believe the issue is related to qlaunch not recognizing that I have a remote database, but I am not sure how to correct the error. Does anyone have any idea how to fix this issue?

I was able to solve the error by reviewing the help information for the command qlaunch (i.e I read thorugh the qlaunch -h information).

In the tutorial I liked to above, it states that a job can be submitted to the queue using the following command.

Submit a job

Try submitting a job using the command:

`qlaunch singleshot`

This command produces an error because the qadapter file cannot be found and the filename of the nonexistent (None type) qadapter file will try to be split by python.

To solve this issue you simply need to specify the location of the qadapter file with the -q flag when running qlaunch.

The following command works for me

qlaunch -q qadapter_slurm.yaml singleshot

I believe that the tutorial should be updated so that the location of the quadapter file is specified explicitly with the -q flag. It might also be useful to throw a descriptive error when this file is not found.

I finally got a rocket to launch using qlaunch!

Another issue that I ran into was that the conda environment was not activated automatically when running a slurm job through qlaunch. Typically when I launch a slurm job in a conda environment it uses the current active conda environment to run the job. When I use qlaunch this does not happen. The workaround I found was to add source activate fireworksEnv to my .bashrc so that the environment is activated by default when slurm runs.

1 Like

Typically when I launch a slurm job in a conda environment it uses the current active conda environment to run the job. When I use qlaunch this does not happen. The workaround I found was to add source activate fireworksEnv to my .bashrc so that the environment is activated by default when slurm runs.

In your my_qadapter.yaml there is a field called pre_rocket which can be a list of commands to run before the rocket itself is launched. This is a good place to put, e.g. your conda activate command.

1 Like

Thanks mkhorton. I used the source activate fw37 command in the pre_rocket field and it runs without putting the command in my .bashrc.

This is what my working my_qadapter.yaml looks like.

_fw_name: CommonAdapter
_fw_q_type: SLURM
ntasks: 1
rocket_launch: rlaunch -w /home/dda/fw_things/fun/q2/my_fworker.yaml -l /home/dda/fw_things/fun/q2/my_launchpad.yaml singleshot
cpus_per_task: 1
ntasks_per_node: 1
walltime: '00:02:00'
queue: null
account: null
job_name: null
logdir: /home/dda/fw_things/fun/q2/logging
pre_rocket: source activate fw37
post_rocket: null