Running FireWorks on NERSC's Hopper compute nodes

smandala · January 22, 2015, 9:53pm

I will be testing Fireworks by running very simple binaries on the Hopper compute nodes on various workflow structures (sequential, parallel, etc). My methodology is to create and upload a FireWork which (via ScriptTasks) moves to the desired directory and runs the binary. To retrieve a FireWork from the Worker (here, the Hopper), the Fireworks documentation recommends running a qsub script (via qlaunch) which calls rlaunch from the MOM node (here is the script I am using, based on the qlaunch’s script):

#!/bin/bash

#PBS -l mppwidth=24

#PBS -l walltime=00:01:00

#PBS -q debug

#PBS -N FW_job

#PBS -d /global/homes/s/smandala/fireworks_code/test_code/queue_scripts/hopper_results

#PBS -o FW_job.out

#PBS -e FW_job.error

cd $HOME/fireworks_code/test_code/queue_scripts/hopper_results

rlaunch -w $HOME/fireworks_code/config_files/my_fworker.yaml -l $HOME/fireworks_code/config_files/my_launchpad.yaml rapidshot

CommonAdapter (PBS) completed writing Template

``

To my knowledge, though, to use the Hopper’s compute nodes, jobs need to be run through “aprun” command; calling rlaunch from the MOM node (seemingly) only runs the FireWorks on the MOM node and does not push the job to the compute nodes.

Overall, I am curious about how to run my FireTasks on the compute nodes. Ideally, I would like to create multiple compute nodes running “rapid-fire” mode to retrieve and complete the pending FireWorks (for workflows involving parallelizable jobs). I can think of a couple of solutions to get the FireWorks pushed to the compute nodes, but I am not sure if my “methodology” or any of these solutions would optimal. Any information or tips regarding how to go about this would be a big help.

Anubhav_Jain · January 22, 2015, 10:17pm

Hi,

As you mentioned, by default if you run a regular (non-MPI) script on Hopper it gets run on the MOM node and not on the compute nodes. This is the case for rlaunch as you mentioned, but it is also independent of FireWorks and just part of how Hopper operates. If you run any non-MPI script on the MOM node it will not go to the compute nodes.

If you want to run on the compute nodes of Hopper, you must use the aprun command at some point in your execution. There are a few ways in which you can do this:

i) If you already have a parallelized code, simply program your ScriptTask (or PyTask, or a custom task) to call “aprun -n <my_code>” as its command. In this case, FW will still pull the job on the MOM node but work will be transferred to the compute nodes as soon as you call aprun on your code.

ii) If you only have a serial script that cannot be run with aprun/mpirun, you can still do the above, with NP=1, you’ll just waste the rest of the CPUs on your machine

iii) You can also achieve low-level parallelization of serial scripts by calling aprun on the rlaunch command in the script you attached. For more details on this and other parallelization modes, see the documentation: http://pythonhosted.org//FireWorks/multi_job.html

Note also that “rapidshot” in your example script is not a valid command, it’s either singleshot or rapidfire.

Hope that helps

Anubhav