How to extract the chemical formulas of all OPTIMADE entries?

It was mentioned that there are ~16.5 million entries across all of OPTIMADE. How does one go about extracting this as a list of compositions?

First, one has to do the discovery of all the providers. Using the official listing, this can be done in Python via the optimade.server.routers.utils.get_providers or (and?) pymatgen.ext.optimade.OptimadeRester.refresh_aliases, in JavaScript / TypeScript via the optimade.Optimade.getProviders.

Second, one has to fetch all the structural entries from all the providers, taking into account pagination, requests rate limiting, removing the duplicates, etc. The formulae will be given either by the chemical_formula_reduced or chemical_formula_hill or provider-specific field, the composition will be given by the elements field or can be deduced from the formula.

Third, one has to take care about the scalability. Fetching and storing dozens of millions of entries in an efficient manner might require some additional engineering approaches, especially on the commodity hardware.

Looks like a very nice research project!

1 Like