Hello,
Over the weekend I submitted just over 500 jobs to Fireworks (this is the largest pipeline I've tried to date) and executed them using:
qlaunch -r rapidfire --nlaunches infinite --sleep 60 --maxjobs_queue 50
All but 6 of them completed successfully and I'm trying to figure out what's happened with those 6. If I try "qlaunch" or "rlaunch", neither command recognizes that there are few jobs left to complete. For example:
$ qlaunch -r singleshot
2015-09-14 10:47:45,974 INFO No jobs exist in the LaunchPad for submission to queue!
Here are some (hopefully) relevant details. I'm happy to provide more.
lpad get\_fws \-d count
514
lpad get_fws -s COMPLETED -d count
508
lpad get\_fws \-s WAITING \-d count
6
lpad get_fws -s FIZZLED -d count
0
Which Fireworks are waiting?
$ lpad get_fws -s WAITING | grep fw_id | sort
"fw_id": 32,
"fw_id": 33,
"fw_id": 34,
"fw_id": 35,
"fw_id": 36,
"fw_id": 37,
What was Firework with fw_id 31, and what happend to it?
$ lpad get_fws -i 31 -d more
[See https://gist.github.com/anonymous/10ea08044f574d190625]
Looking in the launch directory for fw_id 31 (and looking at the yaml file I used to submit my workflow), I know that the Firework with fw_id 31 should be (as far as I can tell) the only dependency for the Firework with fw_id 32.
Was a launch directory ever created for the Firework with fw_id 32? It appears not:
grep \-rl '"fw\_id": 32,' \./\*
(the same is true for the other "waiting" Fireworks)
If I try to rerun these Firworks, still no luck:
lpad rerun\_fws \-i 32 lpad rerun_fws -i 32
2015-09-14 10:50:56,652 INFO Finished setting 1 FWs to rerun
$ qlaunch -r singleshot
2015-09-14 10:52:37,092 INFO No jobs exist in the LaunchPad for submission to queue!
Is there anything else I should try/check/examine?
Thank you,
Derek