Question regarding add_additional_fields_to_taskdocs, workflow uuids

mkhorton · June 18, 2017, 7:04pm

Hi all,

First question is regarding unexpected behaviour in ‘add_additional_fields_to_taskdocs’ – I’m pretty sure I’m making a basic error, but I’m having trouble tracking it down…

Here’s a minimal example:

fws = []

vis = MPRelaxSet(structure)

fws.append(OptimizeFW(structure, vasp_input_set=vis, vasp_cmd=">>vasp_cmd<<", db_file=">>db_file<<", name="test workflow"))

wf = Workflow(fws)

wf = add_additional_fields_to_taskdocs(wf, {'wf_uuid': "test uuid"})

When I check the additional fields dictionary, with:

for idx_fw, idx_t in get_fws_and_tasks(wf, task_name_constraint="VaspToDb"):

    print(wf.fws[idx_fw].tasks[idx_t]["additional_fields"])

It prints {‘task_label’: ‘test workflow’, ‘wf_uuid’: ‘test uuid’} as expected.

However, when I then print ‘wf.as_dict()’, additional_fields reads just {‘task_label’: ‘test workflow’}, with no wf_uuid, and indeed, if running the workflow, the wf_uuid is missing from the final taskdoc.

What am I missing here?

···

Secondly, on the topic of workflows/workflow uuids, I can see a lot of workflows follow the pattern of adding a tag, e.g. tag = "gibbs group: >>{}<<".format(str(uuid4())) to keep track of what workflow a taskdoc resulted from. It seems it might be useful to formalize this somewhere, e.g. having all workflows have a uuid and name that’s inserted into the taskdoc?

Related to this, when writing my own workflows, it seems that a lot of the existing workflows in preset/core.py implement a lot of the same boilerplate code (e.g. take a c dict, check for a few common config params) … is there a standard ‘workflow template’ that should be followed, or maybe this boilerplate could be put into VaspWorkflow base class? I’m asking this question from the point of view of ensuring that, when writing a workflow, I want to make sure I’m supporting all the same features as other workflows, to reduce confusion for the end user.

Best,

Matt

mkhorton · June 19, 2017, 4:04am

Regarding my first question, this has been fixed with a clean install of atomate, and is now behaving as expected. I still haven't been able to track down the original problem (which was using Python 3.6/FireWorks 1.4.1/atomate 0.5.0), but happily it's working now -- apologies!

Matt

Anubhav_Jain · June 19, 2017, 8:11pm

Hi Matt

Glad the first issue has been resolved. Alireza seemed to also have some similar problems … I don’t know where they originated.

As for the UUID

Note #1: the workflows that already do this require grouping together tasks for the downstream analysis stage. i.e., a major part of the workflow is essentially non-functional without this tag.

Note #2: you can already figure out what workflow a task came from but it’s a bit roundabout. You need to go in the fireworks collection and search for something like:

{“action.stored_data.task_id”: 5}

If you look at the FWAction returned by the VaspToDb task you’ll see where that’s being added.

One of the reasons to do it this way instead of the opposite (task links to fw_id, or both links: fw->task and task->fw) is to minimize the amount of “reverse” links being created of a downstream object to an upstream object. This is a topic that I think you are aware of from the EmmetCon so I won’t belabor this point.

Regarding your suggestions:

The main goal of atomate, non-negotiable, is that workflow writers focus on writing scientific workflows, not creating and maintaining lots of metadata and obeying metadata rules and conventions. All extra fluff goes in powerups. So if you want to create a powerup that takes in a workflow, adds some unique id to it and a name that you decide in the metadata of the workflow - feel free to do that. Then when you create the workflow you need to call that powerup too.
Regarding the config_dict, first the original plan was to allow you to have a yaml file (e.g. ~/.atomaterc) that lets you specify default config_dict params and powerups. Thus, if you always wanted to run with a particular config setting or apply a particular powerup, you could. I never had a chance to really implement it. If you want to give a stab at making this section of the code better I’m all for it, e.g. in reducing redundancy both within the file presets/core.py and also removing user redundancies in using that code. There are at least a few low-hanging things that would unambiguously make the code better. Note that these days I tend to not go crazy with abstractions like having a VaspWorkflow base class. However I’m open to a conversation if you have more specifics on what you’d like to do. But note that any new requirement for workflow creators is something I’ll evaluate very critically.

···

On Sunday, June 18, 2017 at 9:04:15 PM UTC-7, [email protected] wrote:

Regarding my first question, this has been fixed with a clean install of atomate, and is now behaving as expected. I still haven’t been able to track down the original problem (which was using Python 3.6/FireWorks 1.4.1/atomate 0.5.0), but happily it’s working now – apologies!
Matt

mkhorton · June 20, 2017, 7:22pm

Hi Anubhav,

Many thanks for this reply.

The way I read your reply is that workflow constraints (whether metadata, classes, abstractions, or whatever) are counterproductive for the goal of atomate — which is to get scientific workflow writers being as productive as possible. This sounds reasonable! I mostly wanted to know if there were any standard formats my own workflows needed to adhere too, so that’s fine if there aren’t. I will take a look at tidying up the presets to see about reducing duplicate code though, just for clarity. On the ~/.atomaterc note, I think adding too many config files can be counterproductive anyway, in that they create ‘hidden variables’ and add an extra path to debug when things go wrong…

One of the reasons to do it this way instead of the opposite (task links to fw_id, or both links: fw->task and task->fw) is to minimize the amount of “reverse” links being created of a downstream object to an upstream object. This is a topic that I think you are aware of from the EmmetCon so I won’t belabor this point.

This was certainly discussed! I think there were two approaches talked about:

Treat the tasks collection as a ‘bag of calculations’, and not worry about what workflows they come from unless absolutely necessary (e.g. the aforementioned cases where the uuids are essential)
Embed metadata in the tasksdoc that contains information on the workflow that originated it and the other tasksdoc(s) it originates from, to help book-keeping (esp. for more complicated workflows)

From people I’ve spoken to, it seems like approach 1 is preferred, and embedding workflow metadata is over-complicating things and — as you say — this information can be retrieved from the fireworks collection if necessary. That said, I don’t think I’d frame it as creating a ‘reverse’ “task->fw” link… Instead, I see it as the fireworks collection embeds a dependency graph, and the tasks collection could mirror this dependency graph (e.g. a StaticFW taskdoc could have a link to the taskid of the corresponding OptimizeFW taskdoc). On the one hand, it seems a shame to ‘throw away’ this relational information by not including it in the taskdoc, but I concede it can definitely over-complicate things when a ‘bag of calculations’ approach is more general/easier to reason about in the long term.

Just some thoughts — not advocating for specific changes here.

All the best,

Matt

···

On Monday, June 19, 2017 at 1:11:05 PM UTC-7, Anubhav Jain wrote:

Hi Matt

Glad the first issue has been resolved. Alireza seemed to also have some similar problems … I don’t know where they originated.

As for the UUID

Note #1: the workflows that already do this require grouping together tasks for the downstream analysis stage. i.e., a major part of the workflow is essentially non-functional without this tag.

Note #2: you can already figure out what workflow a task came from but it’s a bit roundabout. You need to go in the fireworks collection and search for something like:

{“action.stored_data.task_id”: 5}

If you look at the FWAction returned by the VaspToDb task you’ll see where that’s being added.

One of the reasons to do it this way instead of the opposite (task links to fw_id, or both links: fw->task and task->fw) is to minimize the amount of “reverse” links being created of a downstream object to an upstream object. This is a topic that I think you are aware of from the EmmetCon so I won’t belabor this point.

Regarding your suggestions:

The main goal of atomate, non-negotiable, is that workflow writers focus on writing scientific workflows, not creating and maintaining lots of metadata and obeying metadata rules and conventions. All extra fluff goes in powerups. So if you want to create a powerup that takes in a workflow, adds some unique id to it and a name that you decide in the metadata of the workflow - feel free to do that. Then when you create the workflow you need to call that powerup too.

Regarding the config_dict, first the original plan was to allow you to have a yaml file (e.g. ~/.atomaterc) that lets you specify default config_dict params and powerups. Thus, if you always wanted to run with a particular config setting or apply a particular powerup, you could. I never had a chance to really implement it. If you want to give a stab at making this section of the code better I’m all for it, e.g. in reducing redundancy both within the file presets/core.py and also removing user redundancies in using that code. There are at least a few low-hanging things that would unambiguously make the code better. Note that these days I tend to not go crazy with abstractions like having a VaspWorkflow base class. However I’m open to a conversation if you have more specifics on what you’d like to do. But note that any new requirement for workflow creators is something I’ll evaluate very critically.

On Sunday, June 18, 2017 at 9:04:15 PM UTC-7, [email protected] wrote:

Regarding my first question, this has been fixed with a clean install of atomate, and is now behaving as expected. I still haven’t been able to track down the original problem (which was using Python 3.6/FireWorks 1.4.1/atomate 0.5.0), but happily it’s working now – apologies!
Matt

Anubhav_Jain · June 23, 2017, 6:41am

Hi Matt,

Thanks for these additional thoughts.

I mostly wanted to know if there were any standard formats my own workflows needed to adhere too, so that’s fine if there aren’

Just try your best, I’ll let you know in the PR if there’s any issues. I assume you’ve already seen this:

https://pythonhosted.org/atomate/#creating-and-customizing-workflows

Generally, just try to add/modify as little code as necessary, follow a logical structure, and include tests.

On the ~/.atomaterc note, I think adding too many config files can be counterproductive anyway, in that they create ‘hidden variables’ and add an extra path to debug when things go wrong…

It’s true. On the plus side, though, it can help from requiring you to specify the same customizations over and over. I imagine that eventually someone will complain and we’ll need to add this feature.

the tasks collection could mirror this dependency graph

Well each task document is technically independent. i.e., the inputs (e.g., INCAR, POSCAR, etc) and outputs of the task are there in the document (along with the directory the calculation was run in, which is part of the task doc, to do things like tell you the initial CHGCAR or point you to obscure output files). It’s interesting information to know what task came before it procedurally, but the task is still independent (e.g., one could copy the same inputs and run again - without having any of the contents or knowledge of what directory ran before it).

The reason I am resisting is because things just tend to break or go wrong. The more features like this that you add, the more you have to maintain and fix when things go wrong. No one wants to spend a couple of days detecting and patching corrupted dependency graphs in the 1 million task documents in the Materials Project

Best,

Anubhav

···

On Tuesday, June 20, 2017 at 12:22:41 PM UTC-7, [email protected] wrote:

Hi Anubhav,

Many thanks for this reply.

The way I read your reply is that workflow constraints (whether metadata, classes, abstractions, or whatever) are counterproductive for the goal of atomate — which is to get scientific workflow writers being as productive as possible. This sounds reasonable! I mostly wanted to know if there were any standard formats my own workflows needed to adhere too, so that’s fine if there aren’t. I will take a look at tidying up the presets to see about reducing duplicate code though, just for clarity. On the ~/.atomaterc note, I think adding too many config files can be counterproductive anyway, in that they create ‘hidden variables’ and add an extra path to debug when things go wrong…

One of the reasons to do it this way instead of the opposite (task links to fw_id, or both links: fw->task and task->fw) is to minimize the amount of “reverse” links being created of a downstream object to an upstream object. This is a topic that I think you are aware of from the EmmetCon so I won’t belabor this point.

This was certainly discussed! I think there were two approaches talked about:

Treat the tasks collection as a ‘bag of calculations’, and not worry about what workflows they come from unless absolutely necessary (e.g. the aforementioned cases where the uuids are essential)

Embed metadata in the tasksdoc that contains information on the workflow that originated it and the other tasksdoc(s) it originates from, to help book-keeping (esp. for more complicated workflows)

From people I’ve spoken to, it seems like approach 1 is preferred, and embedding workflow metadata is over-complicating things and — as you say — this information can be retrieved from the fireworks collection if necessary. That said, I don’t think I’d frame it as creating a ‘reverse’ “task->fw” link… Instead, I see it as the fireworks collection embeds a dependency graph, and the tasks collection could mirror this dependency graph (e.g. a StaticFW taskdoc could have a link to the taskid of the corresponding OptimizeFW taskdoc). On the one hand, it seems a shame to ‘throw away’ this relational information by not including it in the taskdoc, but I concede it can definitely over-complicate things when a ‘bag of calculations’ approach is more general/easier to reason about in the long term.

Just some thoughts — not advocating for specific changes here.

All the best,

Matt

On Monday, June 19, 2017 at 1:11:05 PM UTC-7, Anubhav Jain wrote:

Hi Matt

Glad the first issue has been resolved. Alireza seemed to also have some similar problems … I don’t know where they originated.

As for the UUID

Note #1: the workflows that already do this require grouping together tasks for the downstream analysis stage. i.e., a major part of the workflow is essentially non-functional without this tag.

Note #2: you can already figure out what workflow a task came from but it’s a bit roundabout. You need to go in the fireworks collection and search for something like:

{“action.stored_data.task_id”: 5}

If you look at the FWAction returned by the VaspToDb task you’ll see where that’s being added.

One of the reasons to do it this way instead of the opposite (task links to fw_id, or both links: fw->task and task->fw) is to minimize the amount of “reverse” links being created of a downstream object to an upstream object. This is a topic that I think you are aware of from the EmmetCon so I won’t belabor this point.

Regarding your suggestions:

The main goal of atomate, non-negotiable, is that workflow writers focus on writing scientific workflows, not creating and maintaining lots of metadata and obeying metadata rules and conventions. All extra fluff goes in powerups. So if you want to create a powerup that takes in a workflow, adds some unique id to it and a name that you decide in the metadata of the workflow - feel free to do that. Then when you create the workflow you need to call that powerup too.

Regarding the config_dict, first the original plan was to allow you to have a yaml file (e.g. ~/.atomaterc) that lets you specify default config_dict params and powerups. Thus, if you always wanted to run with a particular config setting or apply a particular powerup, you could. I never had a chance to really implement it. If you want to give a stab at making this section of the code better I’m all for it, e.g. in reducing redundancy both within the file presets/core.py and also removing user redundancies in using that code. There are at least a few low-hanging things that would unambiguously make the code better. Note that these days I tend to not go crazy with abstractions like having a VaspWorkflow base class. However I’m open to a conversation if you have more specifics on what you’d like to do. But note that any new requirement for workflow creators is something I’ll evaluate very critically.

On Sunday, June 18, 2017 at 9:04:15 PM UTC-7, [email protected] wrote:

Regarding my first question, this has been fixed with a clean install of atomate, and is now behaving as expected. I still haven’t been able to track down the original problem (which was using Python 3.6/FireWorks 1.4.1/atomate 0.5.0), but happily it’s working now – apologies!
Matt