Error when Fireworks tries to submit jobs on SLURM queue

I am running simulations of a software using Fireworks. The jobs that I submit depend on each other. After several hours of running jobs successfully, I get the following error:

2022-09-15 03:17:45,314 ERROR ----|vvv|----
2022-09-15 03:17:45,315 ERROR Error writing/submitting queue script!
2022-09-15 03:17:45,332 ERROR Traceback (most recent call last):
File /.pyenv/versions/3.8.5/envs/wcEcoli3/lib/python3.8/site-packages/fireworks/queue/", line 150, in launch_rocket_to_queue
raise RuntimeError(
RuntimeError: queue script could not be submitted, check queue script/queue adapter/queue server status!

2022-09-15 03:17:45,332 ERROR ----|^^^|----
2022-09-15 03:18:53,659 ERROR ----|vvv|----
2022-09-15 03:18:53,659 ERROR Error with queue launcher rapid fire!
2022-09-15 03:18:53,660 ERROR Traceback (most recent call last):
File “.pyenv/versions/3.8.5/envs/wcEcoli3/lib/python3.8/site-packages/fireworks/queue/”, line 270, in rapidfire
raise RuntimeError(“Launch unsuccessful!”)
RuntimeError: Launch unsuccessful!
2022-09-15 03:18:53,660 ERROR ----|^^^|----
2022-09-29 02:00:29,931 ERROR ----|vvv|----

My system administrator says that this is likely to happen because Fireworks sends so many jobs to the queue so often that occasionally when there is a network hiccup, it tries to queue when it can’t reach the Slurm server. My question is whether there is a way to retry to resubmit a job after a few minutes when this error occurs.

Thank you!


I think you could try to tune the number of jobs submitted in the queue by qlaunch so that it does not overload SLURM. The first parameter I have in mind is the one that limit the number of jobs in the queue (-m).
But, have a look at --nlaunche and --sleep options as well.

To resubmit those jobs (assuming they are FIZZLED), just use
lpad rerun_fws -s FIZZLED or look at the help to fine tune this as well.

Hope this helps.


Thank you for your reply! I am already tuning the number of jobs submitted to the queue. I tried with 30, 50 and 70 jobs, and the same thing happens. I also tried to tune the --nlaunches and --sleep and it didn’t change anything. I think the cluster is not very stable so even if it goes down for a few seconds, that is enough to stop everything.

Resubmitting worked well. Thanks for the advice!


I have a problem with the rerun_fws command as well now. When I try to run lpad rerun_fws -s FIZZLED --task-level --copy-data, I get the following error:

File “.pyenv/versions/zzz/bin/lpad”, line 8, in
File “.pyenv/versions/3.8.5/envs/zzz/lib/python3.8/site-packages/fireworks/scripts/”, line 1551, in lpad
File “.pyenv/versions/3.8.5/envs/zzz/lib/python3.8/site-packages/fireworks/scripts/”, line 641, in rerun_fws
lp.rerun_fw(int(f), recover_launch=l, recover_mode=args.recover_mode)
File “.pyenv/versions/3.8.5/envs/zzz/lib/python3.8/site-packages/fireworks/core/”, line 1695, in rerun_fw
recovery = self.get_recovery(fw_id, recover_launch)
File “.pyenv/versions/3.8.5/envs/zzz/lib/python3.8/site-packages/fireworks/core/”, line 1743, in get_recovery
recovery.update({“_prev_dir”: launch.launch_dir, “_launch_id”: launch.launch_id})
AttributeError: ‘NoneType’ object has no attribute ‘update’

Any ideas why this might be happening?
Thank you very much for all your help!