Working with File Driven Workflows

I’m evaluating Fireworks as a workflow management system for bioinformatics applications and I’m interested to hear what the best practices are for file driven workflows. We have a decent I/O burden with a lot of tools that read from and output to large genomic data files so moving files around is not desirable. In my testing, I’m passing absolute paths between fireworks by returning a FWAction update_spec. This is to account for the different directory structures when choosing rlaunch rapidfire or singleshot. Ideally we’d like to be able to write all the important files to a common directory and have an easy way to remove temporary files after a run is deemed successful. Any thoughts here on the best implementation using fireworks? How are groups handling temporary/intermediate files? Is there a good way to flag and delete them?

Hi,

I am not sure I can give a complete answer but here are some notes that might be helpful:

  • You can certainly pass directories using update_spec, we also do that in Materials Project.

  • Note that if you have a common directory or a desired run directory, you can do that via the “_launch_dir” key in the spec which will set where your FW gets run: http://pythonhosted.org//FireWorks/controlworker.html

  • If you have temporary files, you can delete them as part of your FireTask. i.e., before returning your FWAction, you can delete your temporary files. Alternatively, there are some packages like “monty” that provide syntax for running in a temporary dir as a context manager, e.g. https://www.pythonhosted.org/monty/_modules/monty/tempfile.html . The context manager will take care of cleaning up your scratch files as you exit the code block that needs them.

  • If you need to destroy the directory that FW creates to run the job, you can try the REMOVE_USELESS_DIRS setting: http://pythonhosted.org//FireWorks/config_tutorial.html

  • Note that in the Github code there is also a script called “fwtool” in fireworks/script/fwtool. That script has an option to try and go through all the COMPLETED workflows and delete all their directories. By default this script is not included in the installation but if you think it would be generally useful you can submit a request to make it an official feature. The current version just deletes every completed directory, but it could be generalized to be more specific.

I hope that helps, but please do follow up if there’s something I missed

Best,

Anubhav

···

On Thursday, February 12, 2015 at 4:53:05 PM UTC-8, aarsco wrote:

I’m evaluating Fireworks as a workflow management system for bioinformatics applications and I’m interested to hear what the best practices are for file driven workflows. We have a decent I/O burden with a lot of tools that read from and output to large genomic data files so moving files around is not desirable. In my testing, I’m passing absolute paths between fireworks by returning a FWAction update_spec. This is to account for the different directory structures when choosing rlaunch rapidfire or singleshot. Ideally we’d like to be able to write all the important files to a common directory and have an easy way to remove temporary files after a run is deemed successful. Any thoughts here on the best implementation using fireworks? How are groups handling temporary/intermediate files? Is there a good way to flag and delete them?