Error when Fireworks tries to submit jobs on SLURM queue

I am running simulations of a software using Fireworks. The jobs that I submit depend on each other. After several hours of running jobs successfully, I get the following error:

2022-09-15 03:17:45,314 ERROR ----|vvv|----
2022-09-15 03:17:45,315 ERROR Error writing/submitting queue script!
2022-09-15 03:17:45,332 ERROR Traceback (most recent call last):
File /.pyenv/versions/3.8.5/envs/wcEcoli3/lib/python3.8/site-packages/fireworks/queue/queue_launcher.py", line 150, in launch_rocket_to_queue
raise RuntimeError(
RuntimeError: queue script could not be submitted, check queue script/queue adapter/queue server status!

2022-09-15 03:17:45,332 ERROR ----|^^^|----
2022-09-15 03:18:53,659 ERROR ----|vvv|----
2022-09-15 03:18:53,659 ERROR Error with queue launcher rapid fire!
2022-09-15 03:18:53,660 ERROR Traceback (most recent call last):
File “.pyenv/versions/3.8.5/envs/wcEcoli3/lib/python3.8/site-packages/fireworks/queue/queue_launcher.py”, line 270, in rapidfire
raise RuntimeError(“Launch unsuccessful!”)
RuntimeError: Launch unsuccessful!
2022-09-15 03:18:53,660 ERROR ----|^^^|----
2022-09-29 02:00:29,931 ERROR ----|vvv|----

My system administrator says that this is likely to happen because Fireworks sends so many jobs to the queue so often that occasionally when there is a network hiccup, it tries to queue when it can’t reach the Slurm server. My question is whether there is a way to retry to resubmit a job after a few minutes when this error occurs.

Thank you!

Hi,

I think you could try to tune the number of jobs submitted in the queue by qlaunch so that it does not overload SLURM. The first parameter I have in mind is the one that limit the number of jobs in the queue (-m).
But, have a look at --nlaunche and --sleep options as well.

To resubmit those jobs (assuming they are FIZZLED), just use
lpad rerun_fws -s FIZZLED or look at the help to fine tune this as well.

Hope this helps.

Hi,

Thank you for your reply! I am already tuning the number of jobs submitted to the queue. I tried with 30, 50 and 70 jobs, and the same thing happens. I also tried to tune the --nlaunches and --sleep and it didn’t change anything. I think the cluster is not very stable so even if it goes down for a few seconds, that is enough to stop everything.

Resubmitting worked well. Thanks for the advice!

Hello,

I have a problem with the rerun_fws command as well now. When I try to run lpad rerun_fws -s FIZZLED --task-level --copy-data, I get the following error:

File “.pyenv/versions/zzz/bin/lpad”, line 8, in
sys.exit(lpad())
File “.pyenv/versions/3.8.5/envs/zzz/lib/python3.8/site-packages/fireworks/scripts/lpad_run.py”, line 1551, in lpad
args.func(args)
File “.pyenv/versions/3.8.5/envs/zzz/lib/python3.8/site-packages/fireworks/scripts/lpad_run.py”, line 641, in rerun_fws
lp.rerun_fw(int(f), recover_launch=l, recover_mode=args.recover_mode)
File “.pyenv/versions/3.8.5/envs/zzz/lib/python3.8/site-packages/fireworks/core/launchpad.py”, line 1695, in rerun_fw
recovery = self.get_recovery(fw_id, recover_launch)
File “.pyenv/versions/3.8.5/envs/zzz/lib/python3.8/site-packages/fireworks/core/launchpad.py”, line 1743, in get_recovery
recovery.update({“_prev_dir”: launch.launch_dir, “_launch_id”: launch.launch_id})
AttributeError: ‘NoneType’ object has no attribute ‘update’

Any ideas why this might be happening?
Thank you very much for all your help!