I think my title was poorly phrased.
We have taken care of resource requests within the grid compute environment by adding -R to your supplied headers for qsub/bsub, and supplying pre-formatted resource strings along with the jobs. This seems to remove the need to create a static fireworks job within the queue system or name them, as the resource requests will find a node with the resources needed and execute it. Eventually we might go further and ask for memory/disk/arbitrary_resource as separate variables within the firework object and construct the strings for the user, but for now this seems to work.
I think we can use the system as-is with that slight tweak and the queue launcher in infinite rapid fire mode to launch production workflows. Simply add a workflow to the production collection and wait. Although as you say, a queue launcher that can relaunch a specifc workflow id (not just firework id, but workflow id) would be desirable in many circumstances. We can probably wait for the fork/patch to develop to see if we need to add more code on our own for our purposes.
Another use case, though, is to take advantage of Fireworks DAG parsing but run entirely on one node, either an interactive shall e.g. “bsub -Is bash” then “my_fireworks_worklow.py” and wait, or “bsub my_fireworks_workflow.py” that then runs all your jobs in serial but respects the workflow dependencies.
So for those, we specifically don’t want them ever to be picked up by the queuelauncher in infinite mode. From the documentation you sent, we could assign them a UUID as a category and then force the fireworker that comes up for serial processing to match that UUID, I guess.
Another piece of this puzzle I might have explained poorly is we want a production section and a user section, since we don’t ever want a user to be able to lpad reset production, and we probably don’t want user workflows to be picked up by the generic production rapidfire in infinite mode, all user workflows should be “Try once then report back” with “manual rerun” as an option. We’ll probably obscure the fireworks interface with helper code in order to accomplish this, and expose a light command line front end and give them choices like “run in serial” and “run on grid.”
So I think grabbing a users unix id and forcing them into a collection of their own is probably a good idea. But we still would want “queue launch this workflow id” and “serial launch this workflow id” and “retry this workflow id” for that to work in the long term, and the equivalent for the queue launcher.
I’ll keep playing around with it and check back in if needed as we firm up our own ideas or after the queue code updates. Thanks!
CCH