Questions about JobFlow and additional store

Xavier_Linn · May 26, 2023, 5:03am

Hi,

I have a few questions about using additional stores in the JobFlow API.

Question 1

What the are possible ways to map Job outputs to their respective stores? The pattern I’m aware of from the documentation and what I’ve implemented on my own is to have a Job return a dictionary with keys that map values to specific stores. For example,

def my_job(...):
        
        ...
        
        return {
            "doc_store": task_document,
            "trajectories": task_document.calculation_output.dcd_reports
        }

Job(
     method=my_task,
     trajectories="trajectories",
)

The Job documentation for the kwargs states “The argument name gives the additional store name and the argument value gives the type of data to store in that additional store.”. When I read this, I parsed it as saying that I could return a nested schema and just specify the data type of the objects within the nested schema that would get set to the additional store. For example,

def my_task(...):
        
        ...
        
        return task_document

Job(
     method=my_task,
     trajectories=DCDRecports,
)

where TaskDocument is a pydantic data model that points to another pydantic data model named CalculationOutput that points to DCDReports which is a pydantic data model that I want to store in the additional store. I had tried this, but it didn’t seem to work, so I wanted to check if I was misunderstanding the documentation or if I was doing something wrong.

Question 2

When using additional stores, how do I make use of the output_schema attribute of the Job? The documentation states the output_schema of the Job class is of type BaseModel. Assuming that you need to return a dictionary to make use of additional stores (e.g. the first example code above), how is it possible to make use of the output_schema validation expecting a BaseModel type when the Job is returning a dict type to map data to additional storages.

Question 3

I have a Flow consisting of a list of Jobs and one Flow (e.g. Flow.jobs = [Job, Job, Flow, Job, Job]). All of the Jobs have been configured to write data to an additional store and it works as expected, except for the Jobs within the nested Flow. For example, I’ve written a unit tests here, where I’ve commented out tests that are failing because I’m expecting Job output put into the document store to be linked with data put into the additional store via a uuid. The behavior I’m observing is the Jobs from the nested Flow are not utilizing the additional store. Is this behavior expected or am I doing something wrong?

Thank you in advance!