It was mentioned that there are ~16.5 million entries across all of OPTIMADE. How does one go about extracting this as a list of compositions?
First, one has to do the discovery of all the providers. Using the official listing, this can be done in Python via the
optimade.server.routers.utils.get_providers or (and?)
Second, one has to fetch all the structural entries from all the providers, taking into account pagination, requests rate limiting, removing the duplicates, etc. The formulae will be given either by the
chemical_formula_hill or provider-specific field, the composition will be given by the
elements field or can be deduced from the formula.
Third, one has to take care about the scalability. Fetching and storing dozens of millions of entries in an efficient manner might require some additional engineering approaches, especially on the commodity hardware.
Looks like a very nice research project!