How to extract the chemical formulas of all OPTIMADE entries?

sgbaird · September 7, 2021, 4:59am

It was mentioned that there are ~16.5 million entries across all of OPTIMADE. How does one go about extracting this as a list of compositions?

blokhin · September 7, 2021, 3:33pm

First, one has to do the discovery of all the providers. Using the official listing, this can be done in Python via the optimade.server.routers.utils.get_providers or (and?) pymatgen.ext.optimade.OptimadeRester.refresh_aliases, in JavaScript / TypeScript via the optimade.Optimade.getProviders.

Second, one has to fetch all the structural entries from all the providers, taking into account pagination, requests rate limiting, removing the duplicates, etc. The formulae will be given either by the chemical_formula_reduced or chemical_formula_hill or provider-specific field, the composition will be given by the elements field or can be deduced from the formula.

Third, one has to take care about the scalability. Fetching and storing dozens of millions of entries in an efficient manner might require some additional engineering approaches, especially on the commodity hardware.

Looks like a very nice research project!

sgbaird · November 3, 2021, 1:08am

Would you recommend against this approach?
(The following is based on the OptimadeRester tutorial notebook via e.g. Google Colab)
install dependencies

pip install pymatgen pybtex retrying

Instantiate OptimadeRester and get results

from pymatgen.ext.optimade import OptimadeRester
opt = OptimadeRester(timeout=3600)
opt.refresh_aliases()
results = opt.get_structures()

filter results

import pandas as pd
records = []
for provider, structures in results.items():
    for identifier, structure in structures.items():
        records.append({
            "provider": provider,
            "identifier": identifier,
            "formula": structure.composition.reduced_formula,
        })
df = pd.DataFrame(records)

See also [SUGGESTION] Implement limiting OptimadeRester results to specific properties

ml-evs · November 3, 2021, 10:26am

Hi @sgbaird, I think this would be the right idea if you were performing a quite a restrictive filter, say, for compositions involving particular elements. However, doing this kind of open query on all providers is unlikely to be performant (or particularly useful, without special considerations). Consider the fact that the NOMAD repository contains structures from almost all of the other databases, and they are not necessarily equilibrium structures, and other databases (say COD) will contain structures derived experimentally, which may have missing data. I do agree that this will be a common thing that people try to do, and I see that you have already noted my /all suggestion in the other thread.

Perhaps it would help if you tell us what you are trying to achieve with this query?

From a purely technical point of view, you can restrict the response fields from any OPTIMADE API with the response_fields URL parameter, e.g. example.org/v1/structures?response_fields=chemical_formula_reduced. I believe the pymatgen extension hardcodes these response fields to those required for constructing pymatgen structure objects (I will try to answer to your other post on this in the other thread), but it should be easy enough to write a custom script to scrape this.

JPBergsma · November 3, 2021, 3:00pm

Yes, Pymatgen hardcodes the response_fields in the get_snls_with_filter function. It sets “response_fields=lattice_vectors, cartesian_site_positions, species, species_at_sites”

You can find the source code for Pymatgen here: GitHub - materialsproject/pymatgen: Python Materials Genomics (pymatgen) is a robust materials analysis code that defines core object representations for structures and molecules with support for many electronic structure codes. It is currently the core analysis code powering the Materials Project.

You may also miss some compounds if you look just at the chemical formula, as they can have the same chemical formula but a different structure. e.g. diamond and graphite.

sgbaird · December 9, 2021, 1:42am

I’m surprised that I didn’t respond to this already. Sorry about that! I appreciate your comments. The idea is composition (i.e. chemical formula)-based materials discovery where the validation dataset (containing hopefully hundreds of thousands of potential formulations, even theoretical ones) gets ranked/sorted based on the criteria for the materials discovery campaign (e.g. mat_discover). In other words, what’s every composition that anyone has ever thought of/put into a database? Rather than trying to generate and rank compositions “from scratch”. From scratch would require consideration of chemistry rules, and could contain even more outlandish suggestions than some of the theoretical materials from the databases, and so starting off with the “sum of every composition that scientists have ever spent time on” seemed like an interesting way to go. If you know of any from scratch generative models for composition, I’d be interested to hear, but that’s a bit off-topic for this post.

sgbaird · December 9, 2021, 1:45am

That’s good to know how the response fields are hard-coded and where they can be found. Thanks! And definitely, we’ll miss a lot of the allotropes, but more so as a consequence of using composition-based materials discovery, e.g. with mat_discover which uses (my refactor of) CrabNet for property predictions and is also based on the Element Mover’s Distance. There are a lot of components to mat_discover that are limited to chemical formulas rather than crystal structure. Next step up would be to do the materials discovery with crystal structure, and that’s certainly of interest.

sgbaird · December 9, 2021, 2:00am

As an update, Cameron Hargreaves, author of ElMD, ran (at least something close to) the script I mentioned above. It took ~7 hrs to complete, and 7 hr 20m, and produced 165,605 (99,970 unique) compositions. Many of the APIs failed to start or failed partway.

JPBergsma · December 9, 2021, 12:11pm

One of the reason some of the API’s queried via pymatgen fail is described in this issue
In short, some of the databases do not supply the response fields parameter in the next field.
This causes more fields to be returned than Pymatgen expects and it therefore fails to load these structures correctly.
If you provide queries without specifying the response fields it should work. Pymatgen currently automatically adds this field, so you would have to write a script to download the data yourself.

ml-evs · December 9, 2021, 1:43pm

7 hours doesn’t sound completely unreasonable to me for attempting to paginate through all data from all database providers in serial. It would be great if we had a proper async Python client for OPTIMADE that could speed this up (see discussions at Materials-Consortia/optimade-python-tools#932). However, whilst this query is possible via OPTIMADE, these APIs are not designed for good performance on these “MapReduce” style queries (e.g., you can’t exploit any underlying indexes to get unique values of a certain field across the entire database).

In terms of the number of results you received, I would expect many more. In addition to @JPBergsma’s comment about the response fields/pagination bug that was particularly detrimental to the pymatgen client, I know a few implementations have also fixed pagination in recent days/weeks:

NOMAD, for example, should give you 12M compositions (though obviously much fewer unique ones). Their pagination links should work fine now, but were temporarily broken a few weeks ago.
One edge case would be AFLOW (which again should give you several million results), who have chosen to respond with no results for empty filters. This is a common requirement if the developers are bandwidth limited (like e.g. the standard Materials Project MAPI which will not give you results for filters that are too broad). You could add something like ?filter=nelements>0 in this case…
OQMD made some fixes around the way they represent formulae (which again, was breaking pymatgen in particular).

You might have more luck writing something from scratch that does not need to be deserialized into a pymatgen Structure - the OPTIMADE chemical_formula_reduced should be robust enough as a field that you can do the set comparison over just the strings - though this should be verified.

Finally, I notice in your initial script that you wrote timeout=3600. You may not have used this in the final script, but just a heads-up that this is the timeout per request, so in the case that 7 databases are offline for whatever reason, your minimum runtime will be 7 hours.