Job Chaining and Checkpointing

Hello all,

I often run long jobs, sometimes exceeding the max queue time, or close to it. The best way for me to deal with this in a pure bash (with SLURM) setting is to use checkpointing:

(1) Submit job for short time (e.g. 5 hours)

(2) Near the end of the job, detect that the first Walltime is almost reached. Gracefully stop the job, and tell slurm to requeue the job **with the same job ID. **

(3) Allow this checkpoint, requeue, run, checkpoint… process to proceed until an overall time is reached (e.g. 100 hours)

In bash this is useful for very long jobs that cannot be submitted on a single submission, especially for higher throughput as the shorter jobs can be queued up faster.

The question is how can this process be integrated best with Fireworks? My current solution is to have the bash script for requeueing the job inside of my_qadapter.yaml config file. Which is a decent solution, but it fails because the checkpointing will either overwrite the files for the old checkpoint when a new checkpoint starts, which loses data about the run, or it will have to store the data from each checkpoint in a folder (e.g. launcher/checkpoint_1). If they are sent to a different folder then the data is preserved, but you have to spend time assembling all that data into a single location (e.g. in VASP MD runs, 10 checkpoint files could contain XDATCAR files that need to be assembled into the main launch directory so fireworks has a link to a file that has the full trajectory, not just the trajectory of the most recent checkpoint).

Any thoughts on how to deal with this type of issue?

-Nick

Hey Nick,

I have two solutions for you.

The one requiring the least change from your current setup is just to establish an exterior directory (outside of the launch dirs) to hold all the data for a set of runs (i.e., one complete MD simulation), and store these directories somewhere in the Fireworks’ specs (so you can look them up later if needed). In your bash script after a checkpoint is made, you could make a dir specific to this set of jobs (if it doesn’t already exist), copy the checkpoint data there, make a queue submission, etc. Then when/if your MD sim finishes completely, have your bash script consolidate the data in this exterior directory into a format which you can easily read.

···

The more maintainable solution is using dynamic workflows.

One way to implement this is with a larger workflow. If your runs right now are just one Firework (lets call it VASP_FW), your dynamic workflow might look like this:

VASP_FW1 - Runs, realizes job won’t finish in time. Checkpoints, dynamically adds new FW (VASP_FW2)

VASP_FW2 - Runs, realizes job won’t finish in time. Checkpoints, dynamically adds new FW (VASP_FW3)

… (process repeats)

VASP_FW_N - Runs, job finishes. Consolidates all the data from Fireworks VASP_FW(1 thru N) into the launch_dir for this Firework, so you have all the checkpoint data in one place (the launch_dir of the final FW).

This scheme will probably require you to write custom Firetasks (see here and here for more info), if you are not already doing so. The main con of this is that there is some added complexity, but the pro is that once it is figured out you will have much more flexibility. You can add new Fireworks to the workflow (through the “additions” argument to the FWAction object at the end of run_task in whatever Firetasks you use to run your MD) and you can pass information to subsequent fireworks i.e., the directories of past checkpoints (either thru the new FW’s spec, through the file-passing interface (files_in and files_out), or through the “mod_spec” or “update_spec” arguments to FWAction). Another added perk is that you will have one workflow for an entire MD run, rather than a bunch of separate Fireworks.

The python psuedocode for your Firetask and Firework(s) could look something like:

class RunMDDynamicTask(FireTaskBase):
def run_task(self, fw_spec):
prev_checkpoint_dirs = fw_spec.get(“checkpoint_dirs”, [])

    # run commands for VASP MD, checking walltime, creating checkpoint, etc.
    ...

    if job_finished:
        consolidate_checkpoints_to_this_dir(prev_checkpoint_dirs)
        return FWAction()
    else:
        new_fw = Firework(RunMDDynamicTask(), {"checkpoint_dirs": prev_checkpoint_dirs,
                                               # other params that need to be passed to the next FW})
        return FWAction(additions=new_fw)

if name == “main”:
vasp_fw1 = Firework(RunMDDynamicTask())
wf = Workflow([vasp_fw1], name=“MD Run for System Z”)
launchpad.add_wf(wf)

``

You’ll notice there is no queue submission in the above workflow description. This is because I’d recommend having a cron-job make queue submissions for you automatically (e.g., every 12 hours), which is completely separate from the operation of the workflow above - mixing workflow execution and queue submission tends to be confusing, for me at least. By having crontab submit your jobs automatically, as soon as one of your fws finishes and the next one is “READY”, a queue submission you made previously will pull and run the next job. While much faster than waiting around for old jobs to finish to make queue submissions for new jobs, it will not preserve the job_id AFAIK (not sure why that would be needed though?)

If you prefer to not do that, I guess you could just add a command for submitting to the queue inside the else block of the above Firetask - ie “if job is not finished, submit to the queue with job id X and add another FW to the workflow”; I’ve never done this though so it could wind up in some goofy behavior.

Thanks,

Alex

On Sunday, March 31, 2019 at 4:35:17 PM UTC-7, n…@berkeley.edu wrote:

Hello all,

I often run long jobs, sometimes exceeding the max queue time, or close to it. The best way for me to deal with this in a pure bash (with SLURM) setting is to use checkpointing:

(1) Submit job for short time (e.g. 5 hours)

(2) Near the end of the job, detect that the first Walltime is almost reached. Gracefully stop the job, and tell slurm to requeue the job **with the same job ID. **

(3) Allow this checkpoint, requeue, run, checkpoint… process to proceed until an overall time is reached (e.g. 100 hours)

In bash this is useful for very long jobs that cannot be submitted on a single submission, especially for higher throughput as the shorter jobs can be queued up faster.

The question is how can this process be integrated best with Fireworks? My current solution is to have the bash script for requeueing the job inside of my_qadapter.yaml config file. Which is a decent solution, but it fails because the checkpointing will either overwrite the files for the old checkpoint when a new checkpoint starts, which loses data about the run, or it will have to store the data from each checkpoint in a folder (e.g. launcher/checkpoint_1). If they are sent to a different folder then the data is preserved, but you have to spend time assembling all that data into a single location (e.g. in VASP MD runs, 10 checkpoint files could contain XDATCAR files that need to be assembled into the main launch directory so fireworks has a link to a file that has the full trajectory, not just the trajectory of the most recent checkpoint).

Any thoughts on how to deal with this type of issue?

-Nick

I should add: if the “overall time” your runs take is fixed, you don’t need a dynamic workflow. You can just assemble all your Fireworks in your workflow beforehand

···

On Sunday, March 31, 2019 at 4:35:17 PM UTC-7, n…@berkeley.edu wrote:

Hello all,

I often run long jobs, sometimes exceeding the max queue time, or close to it. The best way for me to deal with this in a pure bash (with SLURM) setting is to use checkpointing:

(1) Submit job for short time (e.g. 5 hours)

(2) Near the end of the job, detect that the first Walltime is almost reached. Gracefully stop the job, and tell slurm to requeue the job **with the same job ID. **

(3) Allow this checkpoint, requeue, run, checkpoint… process to proceed until an overall time is reached (e.g. 100 hours)

In bash this is useful for very long jobs that cannot be submitted on a single submission, especially for higher throughput as the shorter jobs can be queued up faster.

The question is how can this process be integrated best with Fireworks? My current solution is to have the bash script for requeueing the job inside of my_qadapter.yaml config file. Which is a decent solution, but it fails because the checkpointing will either overwrite the files for the old checkpoint when a new checkpoint starts, which loses data about the run, or it will have to store the data from each checkpoint in a folder (e.g. launcher/checkpoint_1). If they are sent to a different folder then the data is preserved, but you have to spend time assembling all that data into a single location (e.g. in VASP MD runs, 10 checkpoint files could contain XDATCAR files that need to be assembled into the main launch directory so fireworks has a link to a file that has the full trajectory, not just the trajectory of the most recent checkpoint).

Any thoughts on how to deal with this type of issue?

-Nick

I’m not sure what the FireWorks core developers think about this, but I’d argue we need first-class support for timeout/walltime issues in the form of a new FireWork state when walltime is reached. It would allow us to handle these kinds of issues a lot more gracefully. I’m told SLURM can send a signal when walltime is about to be reached and FireWorks could listen for that and update its state appropriately, perhaps specifying a FWAction to run when walltime is reached (such as triggering a code to write a checkpoint). Obviously it wouldn’t be able to support all queue adaptors, but for the ones it does I think this would be very useful.

···

On Sunday, March 31, 2019 at 6:30:25 PM UTC-7, [email protected] wrote:

I should add: if the “overall time” your runs take is fixed, you don’t need a dynamic workflow. You can just assemble all your Fireworks in your workflow beforehand

On Sunday, March 31, 2019 at 4:35:17 PM UTC-7, n…@berkeley.edu wrote:

Hello all,

I often run long jobs, sometimes exceeding the max queue time, or close to it. The best way for me to deal with this in a pure bash (with SLURM) setting is to use checkpointing:

(1) Submit job for short time (e.g. 5 hours)

(2) Near the end of the job, detect that the first Walltime is almost reached. Gracefully stop the job, and tell slurm to requeue the job **with the same job ID. **

(3) Allow this checkpoint, requeue, run, checkpoint… process to proceed until an overall time is reached (e.g. 100 hours)

In bash this is useful for very long jobs that cannot be submitted on a single submission, especially for higher throughput as the shorter jobs can be queued up faster.

The question is how can this process be integrated best with Fireworks? My current solution is to have the bash script for requeueing the job inside of my_qadapter.yaml config file. Which is a decent solution, but it fails because the checkpointing will either overwrite the files for the old checkpoint when a new checkpoint starts, which loses data about the run, or it will have to store the data from each checkpoint in a folder (e.g. launcher/checkpoint_1). If they are sent to a different folder then the data is preserved, but you have to spend time assembling all that data into a single location (e.g. in VASP MD runs, 10 checkpoint files could contain XDATCAR files that need to be assembled into the main launch directory so fireworks has a link to a file that has the full trajectory, not just the trajectory of the most recent checkpoint).

Any thoughts on how to deal with this type of issue?

-Nick

@Matt This was actually an issue I worked on at the last hackathon. IIRC the problem is that FWs worker hands over execution to whatever program is running and doesn’t seem to check regain execution until the program is done. Definitely can be done, I just didn’t have time to finish it. Probably worth raising an issue on the repo though.

···

On Monday, April 1, 2019 at 10:07:02 AM UTC-7, [email protected] wrote:

I’m not sure what the FireWorks core developers think about this, but I’d argue we need first-class support for timeout/walltime issues in the form of a new FireWork state when walltime is reached. It would allow us to handle these kinds of issues a lot more gracefully. I’m told SLURM can send a signal when walltime is about to be reached and FireWorks could listen for that and update its state appropriately, perhaps specifying a FWAction to run when walltime is reached (such as triggering a code to write a checkpoint). Obviously it wouldn’t be able to support all queue adaptors, but for the ones it does I think this would be very useful.

On Sunday, March 31, 2019 at 6:30:25 PM UTC-7, [email protected] wrote:

I should add: if the “overall time” your runs take is fixed, you don’t need a dynamic workflow. You can just assemble all your Fireworks in your workflow beforehand

On Sunday, March 31, 2019 at 4:35:17 PM UTC-7, n…@berkeley.edu wrote:

Hello all,

I often run long jobs, sometimes exceeding the max queue time, or close to it. The best way for me to deal with this in a pure bash (with SLURM) setting is to use checkpointing:

(1) Submit job for short time (e.g. 5 hours)

(2) Near the end of the job, detect that the first Walltime is almost reached. Gracefully stop the job, and tell slurm to requeue the job **with the same job ID. **

(3) Allow this checkpoint, requeue, run, checkpoint… process to proceed until an overall time is reached (e.g. 100 hours)

In bash this is useful for very long jobs that cannot be submitted on a single submission, especially for higher throughput as the shorter jobs can be queued up faster.

The question is how can this process be integrated best with Fireworks? My current solution is to have the bash script for requeueing the job inside of my_qadapter.yaml config file. Which is a decent solution, but it fails because the checkpointing will either overwrite the files for the old checkpoint when a new checkpoint starts, which loses data about the run, or it will have to store the data from each checkpoint in a folder (e.g. launcher/checkpoint_1). If they are sent to a different folder then the data is preserved, but you have to spend time assembling all that data into a single location (e.g. in VASP MD runs, 10 checkpoint files could contain XDATCAR files that need to be assembled into the main launch directory so fireworks has a link to a file that has the full trajectory, not just the trajectory of the most recent checkpoint).

Any thoughts on how to deal with this type of issue?

-Nick

@Matt Does fireworks really need to listen for a signal from SLURM though? If you can run a FW with a Walltime handler, it will know that the wall time is being approached. To preserve the separation between queue and FWs, would the best solution not involve a FWAction that is triggered to checkpoint when the Walltime handler is about to hit the max time, rather than when SLURM itself does? That way you don’t have to code a way to “listen” to each queue system’s signal, you just handle it natively. The issue always seemed to me how to make it a logical flow from one checkpoint’s FW to another in terms of where to store the files and how to analyze the results when they are split across multiple folders for each checkpoint.

···

On Sunday, March 31, 2019 at 4:35:17 PM UTC-7, n…@berkeley.edu wrote:

Hello all,

I often run long jobs, sometimes exceeding the max queue time, or close to it. The best way for me to deal with this in a pure bash (with SLURM) setting is to use checkpointing:

(1) Submit job for short time (e.g. 5 hours)

(2) Near the end of the job, detect that the first Walltime is almost reached. Gracefully stop the job, and tell slurm to requeue the job **with the same job ID. **

(3) Allow this checkpoint, requeue, run, checkpoint… process to proceed until an overall time is reached (e.g. 100 hours)

In bash this is useful for very long jobs that cannot be submitted on a single submission, especially for higher throughput as the shorter jobs can be queued up faster.

The question is how can this process be integrated best with Fireworks? My current solution is to have the bash script for requeueing the job inside of my_qadapter.yaml config file. Which is a decent solution, but it fails because the checkpointing will either overwrite the files for the old checkpoint when a new checkpoint starts, which loses data about the run, or it will have to store the data from each checkpoint in a folder (e.g. launcher/checkpoint_1). If they are sent to a different folder then the data is preserved, but you have to spend time assembling all that data into a single location (e.g. in VASP MD runs, 10 checkpoint files could contain XDATCAR files that need to be assembled into the main launch directory so fireworks has a link to a file that has the full trajectory, not just the trajectory of the most recent checkpoint).

Any thoughts on how to deal with this type of issue?

-Nick

Yes, that’s fair. I’m fairly agnostic as to the specific mechanisms, more that there should be a state specifically for walltime/timeouts, and then ways to handle this appropriately (in other words, I think we’re in agreement).

This would be instead of relying on tools like detect_lostruns, and then losing compute time because of lack of checkpointing etc. If FireWorks understood walltime better, hopefully we could eliminate lost runs in the majority of cases, which in itself would be a big benefit.

···

On Apr 1, 2019, at 1:12 PM, nwi…@berkeley.edu wrote:

@Matt Does fireworks really need to listen for a signal from SLURM though? If you can run a FW with a Walltime handler, it will know that the wall time is being approached. To preserve the separation between queue and FWs, would the best solution not involve a FWAction that is triggered to checkpoint when the Walltime handler is about to hit the max time, rather than when SLURM itself does? That way you don’t have to code a way to “listen” to each queue system’s signal, you just handle it natively. The issue always seemed to me how to make it a logical flow from one checkpoint’s FW to another in terms of where to store the files and how to analyze the results when they are split across multiple folders for each checkpoint.

On Sunday, March 31, 2019 at 4:35:17 PM UTC-7, nwi…@berkeley.edu wrote:

Hello all,

I often run long jobs, sometimes exceeding the max queue time, or close to it. The best way for me to deal with this in a pure bash (with SLURM) setting is to use checkpointing:

(1) Submit job for short time (e.g. 5 hours)

(2) Near the end of the job, detect that the first Walltime is almost reached. Gracefully stop the job, and tell slurm to requeue the job **with the same job ID. **

(3) Allow this checkpoint, requeue, run, checkpoint… process to proceed until an overall time is reached (e.g. 100 hours)

In bash this is useful for very long jobs that cannot be submitted on a single submission, especially for higher throughput as the shorter jobs can be queued up faster.

The question is how can this process be integrated best with Fireworks? My current solution is to have the bash script for requeueing the job inside of my_qadapter.yaml config file. Which is a decent solution, but it fails because the checkpointing will either overwrite the files for the old checkpoint when a new checkpoint starts, which loses data about the run, or it will have to store the data from each checkpoint in a folder (e.g. launcher/checkpoint_1). If they are sent to a different folder then the data is preserved, but you have to spend time assembling all that data into a single location (e.g. in VASP MD runs, 10 checkpoint files could contain XDATCAR files that need to be assembled into the main launch directory so fireworks has a link to a file that has the full trajectory, not just the trajectory of the most recent checkpoint).

Any thoughts on how to deal with this type of issue?

-Nick

You received this message because you are subscribed to a topic in the Google Groups “fireworkflows” group.

To unsubscribe from this topic, visit https://groups.google.com/d/topic/fireworkflows/qDtsBAsQIGw/unsubscribe.

To unsubscribe from this group and all its topics, send an email to [email protected].

To post to this group, send email to [email protected].

Visit this group at https://groups.google.com/group/fireworkflows.

For more options, visit https://groups.google.com/d/optout.

Hi all,

A couple of notes

  • FWS can have BackgroundTasks that run separately from the main process. These are run in separate threads from the main rocket program. One could use a background task to help monitor walltimes separately from the main process. Someone could code an official BackgroundTask that handles walltimes - although we’d have to figure out how to get it to interact with the main thread.

  • One aspect that is tricky in the above is that you might not know the walltime in advance, particularly if you are not running in reservation mode. e.g., let’s say you allow 72 hours walltime on one compute resource and 96 hours on another, which might be in-line with the queue policies or processor speeds of the different systems. Then an arbitrary job that doesn’t know where it will run also doesn’t know what walltime to expect, so we can’t in advance say that some WallTimeBackgroundTask should be set at, say, 70 hours. Also, ideally, most users will not want to deal with thinking about the walltime will be when first creating their workflow - they just want to submit their content without thinking about runtime details like processor count or walltimes. In such situations, having a signal like that from SLURM is more useful because it is agnostic of the specific walltime.

  • That said, I’m also a little wary of things that are system-specific. So basically there is no perfect solution …

  • having a separate state for TIMEOUT seems interesting, but I certainly can’t pursue it myself. Someone else would need to do it … seems like Alex Dunn already has some experience here though.

···

On Monday, April 1, 2019 at 1:23:01 PM UTC-7, Matthew Horton wrote:

Yes, that’s fair. I’m fairly agnostic as to the specific mechanisms, more that there should be a state specifically for walltime/timeouts, and then ways to handle this appropriately (in other words, I think we’re in agreement).

This would be instead of relying on tools like detect_lostruns, and then losing compute time because of lack of checkpointing etc. If FireWorks understood walltime better, hopefully we could eliminate lost runs in the majority of cases, which in itself would be a big benefit.

On Apr 1, 2019, at 1:12 PM, nwi…@berkeley.edu wrote:

@Matt Does fireworks really need to listen for a signal from SLURM though? If you can run a FW with a Walltime handler, it will know that the wall time is being approached. To preserve the separation between queue and FWs, would the best solution not involve a FWAction that is triggered to checkpoint when the Walltime handler is about to hit the max time, rather than when SLURM itself does? That way you don’t have to code a way to “listen” to each queue system’s signal, you just handle it natively. The issue always seemed to me how to make it a logical flow from one checkpoint’s FW to another in terms of where to store the files and how to analyze the results when they are split across multiple folders for each checkpoint.

On Sunday, March 31, 2019 at 4:35:17 PM UTC-7, nwi…@berkeley.edu wrote:

Hello all,

I often run long jobs, sometimes exceeding the max queue time, or close to it. The best way for me to deal with this in a pure bash (with SLURM) setting is to use checkpointing:

(1) Submit job for short time (e.g. 5 hours)

(2) Near the end of the job, detect that the first Walltime is almost reached. Gracefully stop the job, and tell slurm to requeue the job **with the same job ID. **

(3) Allow this checkpoint, requeue, run, checkpoint… process to proceed until an overall time is reached (e.g. 100 hours)

In bash this is useful for very long jobs that cannot be submitted on a single submission, especially for higher throughput as the shorter jobs can be queued up faster.

The question is how can this process be integrated best with Fireworks? My current solution is to have the bash script for requeueing the job inside of my_qadapter.yaml config file. Which is a decent solution, but it fails because the checkpointing will either overwrite the files for the old checkpoint when a new checkpoint starts, which loses data about the run, or it will have to store the data from each checkpoint in a folder (e.g. launcher/checkpoint_1). If they are sent to a different folder then the data is preserved, but you have to spend time assembling all that data into a single location (e.g. in VASP MD runs, 10 checkpoint files could contain XDATCAR files that need to be assembled into the main launch directory so fireworks has a link to a file that has the full trajectory, not just the trajectory of the most recent checkpoint).

Any thoughts on how to deal with this type of issue?

-Nick

You received this message because you are subscribed to a topic in the Google Groups “fireworkflows” group.

To unsubscribe from this topic, visit https://groups.google.com/d/topic/fireworkflows/qDtsBAsQIGw/unsubscribe.

To unsubscribe from this group and all its topics, send an email to [email protected].

To post to this group, send email to [email protected].

Visit this group at https://groups.google.com/group/fireworkflows.

For more options, visit https://groups.google.com/d/optout.