I think I understand your use case, but I don’t think we can support that at this time. The main suggestion to keep track of the places that a queue submission failed for a job is a lot of complexity to add to FWS (and maintain for all time) in order to work around a problem with the user’s setup.
I would be supportive of some code to validate some PBS parameters before trying to submit the script if you have any ideas on how to go about that.
Finally, I tried to examine the code because I wanted to adjust things so that (i) qlaunch would exit if it has trouble submitting a job, rather than continuing to poll and try to submit things and (ii) to not have a job show up in “recover_offline” if the queue submission failed. But, when inspecting the code, it looks like these things should already be taken care of:
(i) if the job submission failed, the code should have raised a RuntimeError saying “Launch unsuccessful” which should have quit out of the qlaunch script. Could you let me know exactly (a) what command you are using for qlaunch and (b) what is the content of your error log? The only problem I can see is if you are running in remote / daemon mode (neither of which I use personally).
(ii) The job state of READY is the correct state for a queue submission error. The job itself is still ready to go and I purposely rolled back the job state to reflect this. However the entry of the job from the list of offline runs in the database needed to be removed in order to prevent the “recover_offline” command from searching for these jobs. I just pushed a patch for this in FW1.3.5. Note that for older runs, you will need to use the “lpad forget_offline” command to manually forget the affected FWS. Sorry about that -
On Thursday, August 25, 2016 at 12:34:34 PM UTC-7, [email protected] wrote:
I think our main need in this sort of event is to get some kind of error information to users - especially if the error is due to bad job params. Not having the job resubmit would make sense, but we wouldn’t want stop qlaunch as there may be other (unrelated) jobs that could be run. (It’s a bit of a pain for our user to start the qlaunch daemon.) Instead of marking the job fizzled, would it be possible to annotate the Firework task description somehow (with error info)? We’re using the category field to direct tasks to specific machines (titan, rhea, etc), so if we could change that field, qlauncher would no longer “see” the ready task that failed. Just some ideas… we could also just validate all of the users PBS parameters before hand to try to avoid these kinds of errors in the first place.