Workflows without mainfile

Pavel_Ondracka · May 5, 2021, 8:26pm

I have one more idea regarding the workflows which I wanted to discuss (considering the pending workflow changes/discussion).

So lot of stuff in our oasis are some series/workflows of calculations (EV curves calculations, energy strain calculations, diffusion calculations, converge tests, etc.) Most of the time the calculations are not using any external software to generate the different structures/steps, and users do this by hand, as the input preparation and output evaluation is usually straightforward. I had been thinking about trying to detect such cases and group them with a workflow.

The motivation for this is that when user uploads such calculations right now, it is not clear that his intention was to do for example energy-strain fit. It is visible that he uploaded bunch of calculation which all have the same composition and settings and slightly different structure, but the ultimate goal is not clear until one looks closer.

Lets take the most common case, Birch-Murnaghan fit as an example. In theory, it shouldn’t be too hard to detect such calculations. They folders for the calculations at different volumes will be likely in the same upper-level folder, names of folders for the different volumes usually keeps consistent except for random numbers, like (vol0.99, vol1, vol1.01, …), they will have the same input settings, same number and types of atoms and the atomic positions will be also quite similar (dunno if there is some tool to judge the similarity of structures). And looking at the differences one can usually guess if this is for example series for BM-fit, some deformations for specific Cxy elastic tensor component fit, stress-strain method, diffusion pathway, etc… So the “parser” would detect such cases, create a new entry with the correct workflow to group the calculations together and do the appropriate post processing (there it would be BM-fit to calculate the bulk modulus, equilibrium volume, etc…) If we can later group the search entries by the workflow (and make it searchable by the ultimate property obtained from the workflow), it should make it much easier to find stuff.

Now the thing is, there is no mainfile, and as I understand it, the 1 mainfile = 1 entry scheme is quite hardcoded in Nomad right now.
The easier solution would be to instruct users to just put some blank custom-named file like “nomad-workflow-parser-mainfile” in the folder containing the calculation which would trigger the right parser, but I’m not sure I can expect this from my users

I’ll be grateful for any comments, if you think this could reasonably work or not.

laurih · May 6, 2021, 5:52am

Hi Pavel!

It is an interesting idea and we have briefly experimented in the past with somethig similar. We do have some rudimentary routines for detecting two simple “implicit workflows”: EoS calculations and what we call “parameter variation calculations” (identical structure, different methodology). We use these routines to show plots in the NOMAD Encyclopedia, which deals with materials, you can see an example here (In the calculations-tree you can see a few eos/* and par/* entries which correspond to equation of state and parameter variation groups automatically detected by NOMAD.

The mechanism we now have in place is not very flexible as it covers only a few of the DFT codes and the detected groups are only visible in the Encyclopedia. As you mentioned, there are a few core routines needed for automatically detecting these workflows:

Identifying structurally identical calculations (convergence tests, …)
Identifying structurally similar calculations (EOS, strain, diffusion, …)
Identifying methodologically identical calculations (EOS, strain, diffusion, …)

The first two can be done quite easily, but the detecting methodologically identical calculations can be quite hard, since our parsers are nowhere near perfect in capturing the used methodology, and additionally we would need to define the set of meaningful metainfo that defines a method (this also depends on the context: plane waves vs gaussian basis set, etc.).

Maybe @mscheidgen can say more, but as far as I know we do not right now have any concrete plans on extending the infrastructure towards this direction. It is an interesting concept though.

mscheidgen · May 6, 2021, 7:31am

Thanks for your suggestions. As you imagine, workflows are a super hard problem and lots of guess work. There are multiple ideas that we more or less try to implement.

Explicit parsers for established “workflow engines”. Like what we do with phonopy, or you want to do with lobster. Basically parsers for codes that depend on external calculations.
Implicit “parsers”. What Lauri describes is done by the Encyclopedia on the fly. We could do this during processing and provide the results as NOMAD entries. I could imagine a “parser” that matches a folder structure and is triggered by a user somehow. But maybe the next bullet is better for this use-case?
Analysis scripts. If you do “analysis” (running some python on top of your calculations) this could be run in NOMAD. Think of it as an adhoc, uploaded parser with itself as mainfile. Pushing security concerns aside and provided that our metainfo is more flexible, user friendly, supports plotting, etc. This could provide some actual value. Your scripts would be publishable, the results would be FAIR (standard format, findable in our search, etc.). The data they produce could be presented in our UI. While this sounds like a huge entry barrier for an individual to cross, think of a whole lab building these “parsers” for their re-occuring workflows. Of course this somehow needs to be linked to github, binder, jupyter, etc. for convenience.
User editable entries. We want to provide an archive/metainfo editor. The user could create workflows in NOMAD (e.g. based on her calculations). This could be combine with the other ideas to provide some semi-automated workflows.

Only the first bullet is really implemented. There is a concrete plan for the last (editable entries). We are doing some steps towards implicit “parsers” and analysis scripts. We first need to overhaul our processing (which will happen the upcoming weeks).

I guess, you could realize your idea with a mix of these features.

Pavel_Ondracka · May 6, 2021, 8:32am

Thanks for the comments. Just to make this clear, I don’t need a solution right now. At this point I’m still thinking about this, and I’m grateful for any feedback. While I like the point analysis scripts and user editable entries, I’m not so sure it will work in our case. Mostly it requires some work on the user part

Our group is mostly MSc/PhD student (and few postdocs) and most of the people is not going to stay long. Forcing them to actually use research data management (RDM) for experiments was not easy as the students usually just work with their data and they don’t accumulate enough over their stay to actually need RDM (and it is also arguably sometimes better from educational point of view if the student do the whole workflow themself than to use some script/workflow engine). However, there are of course some long-term running projects and project transfers between persons so that’s why the RMD is important. The whole nomad oasis setup was partially motivated by myself spending significant amount of time having to sort through some published old conflicting results. I was contacting users long gone, asking for their data, getting permissions to access old cluster accounts and trying to reconstruct the whole story. In the end it was just easier to recalculate. However as I said most users deal with just their data and its tricky to explain the benefit of archiving it properly (as it mostly benefits the next person and not them directly). In this regard the Oasis is perfect as the user requirement are minimal (zip the stuff, dump it there, add some comments/links and its done).

The problem with point 3 (analysis scripts) is also that the current oasis state right now is probably not flexible enough to allow convenient workflow. Think of user doing a bunch of calculations like several EoS workflows and trying to run some analysis script in NOMAD oasis. He has to upload everything there, and then run the scripts. It might work OK for few EoS and then he finds problem with the last one (maybe one calculation didn’t converge properly). He redoes the calculation and now he would like to upload the fixed entry, but it will be in a different upload, there is no (?) way to replace it in the old upload. You also can’t delete published uploads or entries from uploads or replace with updated ones. So while using the analysis scripts in nomad has the advantage of having the metainfo accessible and you don’t have to know anything about the code specific output, it would be probably simpler to just have the users install nomad locally with pip, create a script which calls the parsers, reads the matainfo, does the workflow analysis and produces some output which can be later parsed on upload to Oasis with custom parser from point 1 (or creates a nomad metainfo entry which could be loaded directly when uploading later as in point 4). But maybe I misunderstood the point 3 and this is already what you are suggesting?

mscheidgen · May 7, 2021, 6:52am

I agree with your descriptions. These are exactly the use-cases and problems we also see. We are currently making significant changes to the upload system to allow better integration with personal workflows and local infrastructure (e.g. syncing with user filesystem, incremental editable uploads, etc.). From the perspective of an individual user (e.g. student), Nomad currently does not provide much immediate value. Therefore, it feels as if RDM/Nomad has to be forced upon people.

Our assumption is that if users need to analyse their calculations anyways, we just need to make nomad convenient and flexible enough to actually help with analysis. The value propositions are you don’t have to manage your files, its easier to inspect your data, you don’t have to parser your files, we offer you a meaningful python setup that you can use out of the box, lots of examples you can use, you can be a good research and have everything properly published. But its a long way and lots of us to do.