Putting simulation results into "results" subsection

Hi
I would like to upload data from a simulation software that is currently not recognized by Nomad (the simulation software is called “Lightforge” and is a kinetic Monte Carlo software to simulate organic LEDs). In order to upload the data and make the data F.A.I.R., I believe I need to write a schema first. Since there are yaml and python schemas, I would like to start with yaml schemas because they seem easier to write.
The highest section of the Nomad archive (I think it’s called “EntryArchive”) consists of several pre-defined subsections such as “definitions”, “results”, “data”, “run” and more.
I would now like to fill out the “results” subsection with my own quantities and numbers as well as with the simulation results.
The simulation results of “Lightforge” are .dat files consisting of 2 columns (separated by tabs): x-axis and y-axis values.
So all in all, I would like to accomplish 2 things:

  1. I would like the values from the .dat files to appear in the “results” subsection (and not the “definitions” or “data” subsections).
  2. I would like to be able to define my own keys and their values in the “results” subsection (such as “method: kinetic Monte Carlo” as shown in line 4 of the following example yaml schema):
results:
  sections:
	material: Graphene
	method: kinetic Monte Carlo
	properties:
		structures:
		electronic: 
			quantum_efficiency_x: <x-axis values from first column of the .dat file>
			quantum_efficiency_y: <y-axis values from second column of the .dat file>	

It would be very nice if someone could help me with the yaml schema. Thank you very much in advance!
ps: I dont know why some words are in bold, I didnt use the bold function.

Hi Fabian,

Thanks a lot for contacting us. Jose Pizarro here, one of the experts in computational data.

So as far as I understand, the section results is protected to custom schemas defined by users. This is because things like elasticsearch and then search filters are applied to it (maybe someone from Area D can correct me if I am wrong). The working procedure is that, sections like data (thought in the case of computational data is typically under run) can be populated with whatever schema you want, and then a set of normalizations apply to map some of that data into results. In order for these normalizations to be applied, some base class have to be defined in the Python schema. A very simplistic view of this would be:

class KMC(MSection):
   <whatever quantities define a kinetic MC calculation; similar to what you define in the YAML schema but here>

   def normalize(self):
       <populating results sections and quantities>

You can of course, decide to simply populate data with your metadata schema and plots. Later we can give it support by defining these classes; there is no real restriction for you to start using it right away and later revisit it for refinement :slight_smile:

In any case, if you feel it and as an alternative way, we can help you supporting your data from Lightforge, so the data you want (in this case, quantum efficiencies, but it can be anything) is correctly parsed, the normalizers are then applied and results is properly populated, and even add searches for your specific data. If the idea attracts you, feel free to drop a yes, I will be happy to organize it so you can quickly start using it.

Best regards,
Jose

Thank you very much for your fast reply, Jose! For now, I will try to make a python schema work for us.

Sure, feel free to reach out if you have any problems, and also check out our last tutorials, these will help you a lot when creating this schema: 8, 10, and if you feel brave enough, tutorial number 9.

All the best,
Jose

Hi Jose

I’m sorry to bother you again. I’m working on a python schema, and would like to clarify:
In section results, subsection properties, I would like a new subsection called “Quantum efficiency” and assign it a value of 50%.
This is impossible, right? (because this would be a custom schema and results section doesn’t accept custom schemas)

Hi @fabian_li!
In principle, we would aim to add this quantity to our curated results in the near future. More precisely, it would possibly go to results.properties.optoelectronic. However, we would need to discuss much more this, as this would be re-used by many potential future use cases, including experimental ones.
If we would include this, we will possibly need to have another quantity to consider at which injection level this quantum_efficiency was measured or derived. And also from which technique this was derived, to account for photoluminescence, external_solar_cell_quantum_efficiency, electroluminescence, and so on.

You can however have this in your custom schema. Your quantity would look something like this:

quantum_efficiency= Quantity(
    type=np.float64,
    description='''
    The integrated quantum efficiency of the system.
    Per definition a positive value less than or equal to 1.
    ''',
)

My recommendation is that you store it not in percentage, as we do not have support for percentage units at the moment.
In the future, we could also copy it to the results section when we define how quantum_efficiency should look in there. This is anyhow planned work for us.

Please, let us know if you get further stuck. Maybe it helps to share your repo with one of us to see how we could help more.

Thanks for the quick reply. I’d like to clarify 1 more general thing regarding python schemas:

I know I can populate the results section with my own numbers (using normalizer functions), but only within the official schema that’s currently in use, right? In other words: There’s no way for myself to change the schema of the results section, not even with normalizer functions, only Nomad staff can change it, is this correct?

Sorry if this question seems repetitive, but this is important info and I have to report this to my supervisor.

Hi @fabian_li, what you say is correct. The Results section is something that we control in the source code. One of the reasons for this is that it is our way to keep interoperability across the many different entries, and we need to have it curated. Having said so, we welcome suggestions, and yours about including a quantum efficiency value is a reasonable one that we had planned anyways in our optoelectronic materials activities. Unfortunately, Most of the staff is on vacation, so it is likely that it will happen immediately.
My suggestion is the following:

  • Define a quantum_efficiency Quantity in your custom schema.
  • When the Quantity in Results is ready, you can populate it via a normalizer and a reprocessing.

By the way, you can also search for your custom schema quantities. Let me know if you need help with that.

For us to understand better. Are you writing a parser or a schema plugin? Are you planning to use this in a NOMAD Oasis or to upload data into the central NOMAD service?

Thank you!

Hi. I’m writing a schema plugin. The research group I’m in plans to upload their data to our local Nomad Oasis. We use ca. 60 different simulation programs (most of them not currently recognized by Nomad), and eventually we want to upload all our simulation data.

So if I create a custom python schema, which will put my simulation results into the data section, those simulation results become searchable by elasticsearch, search filters etc., just like data in the results section? Then what is the difference between data and results sections? Or is there a difference between “interoperable” and “searchable”?

Hi @fabian_li,

Sorry, as Pepe said I (as well as others, mainly the people related with simulations in NOMAD) was on vacations. Now I am back and ready to help you with everything you need! :slight_smile:

I think it makes sense to do a quick meeting (probably not more than 30min - 1 hour) in which you can ask us what you want and we prepare to help you better and implement the normalization you need. For that, you can get in touch to [email protected]

I see the development of new features for simulations in NOMAD as:

  1. Someone proposes a new metadata schema (either using YAML or contacting us directly to write a plugin/parser) and write it to run.

  2. She/he uploads that data into NOMAD (Oasis or Central, this is up to the user, but eventually data should go into the central repo). At the same time, we work on the normalization side, defining new quantities and sections inside results. The goal here is to mainly define results.method and results.properties. I specifically said results.method first as it is usually easier to define the filters you want depending on the method, rather than going directly into properties :slight_smile:

  3. If the users want an automatic recognition instead of having to add YAML files manually, we work creating new plugins/parsers, approximately one for each simulation package. Even thought it takes time to write these, at the end of the day is much more comfortable for the users as now they don’t need to add any extra YAML file anymore. But this point can be skipped.

  4. After results and run are populated, we can also work on some specific visualizations. This is the last step, longer-term, and it is pretty much community-dependent, i.e., which are the typical plots that any KMC expert likes to see and immediately recognizes (e.g., I-V curves).

So in short, point 2. would require that we eventually sit together. I’d say the earlier, the better, so we can help even at point 1. (and even talk about point 3.). By the way, once you are able to pass point 1., you don’t need to wait to publish the data, as we (like Pepe said) can always reprocess data to recognize newly defined quantities.

So if I create a custom python schema, which will put my simulation results into the data section, those simulation results become searchable by elasticsearch, search filters etc., just like data in the results section? Then what is the difference between data and results sections? Or is there a difference between “interoperable” and “searchable”?

I think what Pepe meant (please @Pepe_Marquez, correct me if I am wrong) is that you can filter out some custom schema quantities after you do the search using results. There is the difference: querying is done solely in results, while run helps you filter out further the retrieved archive.

All the best,
Jose

Quick question in between: creating custom subsections and custom quantities in run is possible, right? And searching these custom quantities is possible by using the search function for “user defined quantities” on the GUI, right?

Hi Fabian,

creating custom subsections and custom quantities in run is possible, right?

Yes, it is possible. But I will suggest to follow the current schema and just extend on top of it. You can, of course, suggest and give feedback and together we can see how to implement those globally once you have data in your OASIS.

And searching these custom quantities is possible by using the search function for “user defined quantities” on the GUI, right?

After talking with @Pepe_Marquez and @mscheidgen, it seems that, for the run section, it is not feasible to have the “user defined quantities” searchable in the GUI. To my understanding, this is because this searchability option is dynamically reading from the data section, which would be very complicated for the 12 million entries that contains a run section.

Ok then I would like to clarify:

If run is not searchable, but data is, why should we upload our simulation data to run? At first sight, run has no advantages over data. However, other computational groups seem to use run a lot, and they never use data. This tells me that there must be some advantage of run over data (at least for computational groups): is it because run has higher interoperability due to its built-in schema than data?

Hi. I also have a question about yaml-schemas and the results section:

So, these normalizations are python functions, and since yaml-files don’t understand functions, I always thought that yaml-schemas cannot populate results. Therefore I started working on python-schemas. However, this yaml-schema works:

results:
  method:
    method_name: DFT
    workflow_name: single_point

So can yaml-schemas populate results or not ?! Thanks in advance!

Hi Fabian,

If run is not searchable, but data is, why should we upload our simulation data to run ? At first sight, run has no advantages over data .

Well, the advantage of data (with the User Defined Quantities) happens because the volumen of entries in the Archive for data is much smaller than run. If the number would be equivalent (i.e., tenths of millions of entries that contain data) I imagine that the problem would be there and User Defined Quantities would not be feasible anymore.

This tells me that there must be some advantage of run over data (at least for computational groups): is it because run has higher interoperability due to its built-in schema than data ?

The main reason is merely historical. And on the contrary, run, data, and nexus serve their purpose for each of the specific areas of FAIRmat: computations, ELNs (experimental schemas), and NeXuS initiative (characterization schemas). So that these sections are rather very community specific and hence not-interoperable between them.

Another reason why other users use run over data is that, for simulations, it is more convenient to write a simple python script (what we call plugin or parser, point 3 in my explanation above) that automatically imports sections and quantities from the NOMAD schema and populates them by reading files and so on. You can find a bunch of them on the Dependencies sub-module in GitHub: GitHub - nomad-coe/electronic-parsers .

However, this yaml-schema works:

I am surprised it is possible to add any results quantities in YAML. I am guessing this is a unintended behaviour, but let me ask around first to confirm to you. Maybe @Pepe_Marquez or @aalbino2 can say more here.

Thanks for being so active. It is being very nice to answer your questions :slight_smile:
Jose