[SUGGESTION] Implement limiting OptimadeRester results to specific properties

The OptimadeRester API seems to only allow for returning a pymatgen Structure or StructureNL. I’m wondering how difficult/beneficial it would be to allow the output to be limited to a particular type (or types). E.g. chemical formulae only (How to extract the chemical formulas of all OPTIMADE entries? - #2 by blokhin) similar to MPRester:

MPRester().query(properties=["task_id", "pretty_formula"])

Or is there a way to limit the output using an Optimade filter that I’m overlooking?

This may be related to [SUGGESTION] "/all" or "/archives" endpoint for downloading *all* entries - #2 by ml-evs as a way to reduce the computational burden of downloading everything at once. It seems that I’ll be running the code for several days in order to obtain all chemical formulae (first by downloading all Structures). Just hoping I don’t get an out-of-memory error once the computation finishes :sweat_smile:. For now, I might be better off using the original APIs for the large databases (OQMD for example) which has this feature to save a few days of computation time.

I realize that OptimadeRester is still under development. It’s awesome to see that this class has been implemented in pymatgen, as well as Optimade itself!

Thanks for the suggestion! As I noted in my other answer, this is already possible within OPTIMADE using response_fields, so it would be possible to add this as a parameter to pymatgen's OptimadeRester.

Much like the original MPRester (which used to(?) error if your query returned too many results), I don’t think the OptimadeRester is really designed for this kind of query (all structures from all providers). You may be better off using the provider list from pymatgen (or our other package optimade-python-tools) to write your own scripts that do not require deserializing the responses as pymatgen objects. You can also increase the requested page limit for each databases (with the page_limit URL parameter) so that you do not have to make as many requests. If you are feeling particularly keen, you could write an async function for getting the structures from each provider, so that they can all be queried asynchronously (and say, cached to disk) to avoid one provider being a bottleneck (I imagine we will develop some of this functionality into optimade-python-tools that could then be used by other client code).

For the OQMD specifically, I know they are in the process of changing hosting, which may be why it is much slower than usual at the moment.

Finally, as I mentioned in my other comment, I would be careful interpreting the results of multi-provider queries like this. The statistics will not be valid across providers (e.g., number of compounds involving a given element), many structures may not be chemically stable (i.e. high formation energy) or even at equilibrium, depending on the database.

1 Like

Thanks for this. I think that I may go with a single, large database (either OQMD or Nomad). Right now, all I really need is a long list of formulas for which there might be interesting compounds related to our target property. It would be best for me to have the formation energy since eventually, we’d like to try to make some of the more promising compounds experimentally and characterize them. I’m also realizing that with ~3 million entries in Nomad, there’s a good chance that all of the other databases have significant overlap (for example, ones with a few thousand formulas might be completely represented in NOMAD in terms of chemical formulas), and it may not be worth the effort (at least for now, for me) to try to get a list of “every chemical formula that’s ever been put into a materials database”.