Extract chemical formulas, stability measure, identifier from all NOMAD entries excluding certain periodic elements

sgbaird · December 9, 2021, 7:11am

TL;DR

I want to use a large list of compounds for materials discovery (scoring/ranking the compounds), and I have a working implementation (see below), but I’m not sure if I’m looking at the right measure for stability, I haven’t figured out how to exclude certain compounds, and many fewer entries (~200k) are returned via the Python API Client compared with what I expected based on the GUI results (~8M)

EDIT: using conda-forge 0.10.4 to install produces only ~200k whereas using pip 0.10.4 produces the expected ~8M.

My Goal

Great resource! I was looking into using OPTIMADE, but ran into issues with quite a few of the APIs failing (which will hopefully resolve at some point in the future). After some discussion, I decided that NOMAD would be a really good resource to use directly instead. The main appeal for me is to get a list of candidates for materials discovery. Pulling chemical formulas from databases such as NOMAD seemed more fruitful than generating random compositions. After getting this list, I can then rank/sort based on a trade-off of their predicted performance and novelty using the DiSCoVeR algorithm (disclaimer: I am one of the authors). This is similar to what Bayesian optimization does in suggesting points that handle the trade-off between exploration and exploitation, except that DiSCoVeR puts specific emphasis on chemical novelty. In other words, I want to perform materials discovery based on a list of candidate compounds. The fact that NOMAD stands for “Novel Materials Discovery” is extra persuasive for me

What I Tried

I saw that NOMAD takes OPTIMADE search queries, and that for a GUI-based search query (at least if the “entries” tab is selected), you can press the <> button to see the corresponding code. Without any search terms, I get the following:

from nomad import client, config
config.client.url = 'http://nomad-lab.eu/prod/rae/api'
results = client.query_archive(query={
    'domain': 'dft'})
print(results)

I looked through the dropdown list of possible quantity=value combinations.

I found NOMAD Meta Info and started browsing through the items there (really cool once I figured out what to do with the info).

As I kept looking through the Python API Client docs, I noticed the first example and started modifying this. In particular, I changed max=100 to max=None and changed required=... such that it only accesses energy_total, chemical_composition_reduced, and calc_id.

I wasn’t able to figure out how to exclude certain elements from the query, though I do know the OPTIMADE filter that would produce the desired result:

excluded_elements = [
    "He", "Ne", "Ar", "Kr", "Xe", "Rn", "U", "Th", "Rn", "Tc", "Po", "Pu", "Pa",
    ]
f"NOT (elements HAS ANY {excluded_elements})".replace("'", '"')

My Working Implementation

"""
A simple example that uses the NOMAD client library to access the archive.

Modified from source: https://nomad-lab.eu/prod/rae/docs/archive.html#first-example
to download all chemical formulas `chemical_composition_reduced` along with their calculation ids `calc_id` and `total_energy`.
"""
import pandas as pd
from nomad.client import ArchiveQuery
from nomad.metainfo import units

# %% query NOMAD database
query = ArchiveQuery(
    # url="http://nomad-lab.eu/prod/rae/api",
    query={"dft.code_name": "VASP"},
    required={
        "section_run": {
            "section_single_configuration_calculation": {"energy_total": "*",},
            "section_system": {"chemical_composition_reduced": "*"},
        },
        "section_metadata": {"calc_id": "*"},
    },
    per_page=100,
    max=None,
)

print(query)

# %% extract values
hartree_total_energies = [
    result.section_run[0]
    .section_single_configuration_calculation[-1]
    .energy_total.to(units.hartree)
    for result in query
]

hartree_total_energy_values = [
    hartree_total_energy.m for hartree_total_energy in hartree_total_energies
]

formulas = [
    result.section_run[0].section_system[0].chemical_composition_reduced
    for result in query
]

calc_ids = [result.section_metadata.calc_id for result in query]

# %% combine and save
df = pd.DataFrame(
    {
        "calc_id": calc_ids,
        "formula": formulas,
        "hartree_total_energy": hartree_total_energy_values,
    }
)

df.to_csv("all-formula.csv", index=False)

Questions and Comments

is total_energy (or perhaps total energy per atom) what I should be looking at for a measure of stability? Ideally, this would be something that I could use for filtering, similar to e.g. e_above_hull < 50 meV for Materials Project
why are there so many fewer entries using the Python API Client (263540) compared to the GUI (8,208,077) for query={"dft.code_name": "VASP"} even if I remove required=... entirely? Is this because I’m using the Python API Client and not one of the other methods? Or maybe the default URL? (EDIT: this seems to be because I was using conda instead of pip to install, despite each listing the same version: 0.10.4)
how do I exclude a list of certain elements?
any other suggestions/modifications?

Sterling

mscheidgen · December 9, 2021, 8:58am

You already uncovered a lot here. I’ll try to answer your questions

Total energy

In principle there is a energy_total_TO_per_atom quantity, but we are currently only supporting this for FHI-aims calculations.

The amount of quantities that we can put into our search engine is limited. Total energy is currently not part of it. Which quantities are index in our search engine is debatable of course, but it is a long process. Fee free to make an argument in favour total energy.

Why are there so few results?

This (mostly) depends on the max parameter. If you set it to 100 (as in our example), you get like 2xxK (Number queried entries: 253449). If you set it to None (as in your last example), it should give you like 7.xM (Number queried entries: 7735021).

I think the output is very misleading and we should fix this. The ArchiveQuery is going through all uploads that have data which matches your query. It then accumulates the number of entries in those uploads that match your query. But it stops once this number is bigger than your max.

Another smaller factor is your required spec. It will add this to the query as well and only consider entries that have the requested information (total energy, system, etc.).

Exclude elements

Your idea using dft.optimade is good, but for some bug it is not possible to use it with ArchiveQuery. I don’t know why yet, we have to investigate.

Anyhow, you can exclude elements with a query like this:

query = ArchiveQuery(
    # url='http://nomad-lab.eu/prod/rae/beta/api',
    query={
        '$and': {
            'dft.code_name': 'VASP',
            '$not': {
                'atoms': ["Ti", "O"]
            }
        }
    }

More tips

The ArchiveQuery is very specialised. It tries to parallelise multiple API calls for the specific purpose of downloading archive information. But it has its bugs and limitations.

If you run into limitations with ArchiveQuery, you can access NOMAD APIs more directly. For example with:

import requests
import json

response = requests.post(
    'http://nomad-lab.eu/prod/rae/api/v1/entries/archive/query', json={
        'query': {
            'and': [
                {
                    'dft.code_name': 'VASP',
                },
                {
                    'not': {
                        'atoms': {
                            'any': ["H", "C", "Li", "Na", "K", "Rb", "Cs"]
                        }
                    }
                }
            ]
        },
        'pagination': {
            'page_size': 10,
            'page_after_value': '----9KNOtIZc9bDFEWxgjeSRsJrC'
        },
        'required': {
            'section_run': {
                'section_single_configuration_calculation[-1]': {
                    'energy_total': '*'
                },
                'section_system[-1]': {
                    'chemical_composition_bulk_reduced': '*'
                }
            }
        }
    })

print(json.dumps(response.json(), indent=2))

This uses our new v1 API and might work more reliably. But here you have to paginate yourself. This example will give you page_size=10 results. The results will contain a next_page_after_value, which you need to populate page_after_value with in you next request. With a loop like this you can download lots of data. You can increate page_size, but you have to be careful to not run into timeouts, if the requests become to large.

Another thing, you can use section_single_configuration_calculation[-1] to only get the last instalment of a section. Many entries contain a lot of systems and calculations (usually with VASP those form a geometry optimisation).

If you are only interested in the formulas and don’t care about energies, etc. You can also skip the archive and just access the basic metadata (formulas are part of this, energies are not).

import requests
import json

response = requests.post(
    'http://nomad-lab.eu/prod/rae/api/v1/entries/query', json={
        'query': {
            'and': [
                {
                    'dft.code_name': 'VASP',
                },
                {
                    'not': {
                        'atoms': {
                            'any': ["H", "C", "Li", "Na", "K", "Rb", "Cs"]
                        }
                    }
                }
            ]
        },
        'required': {
            'include': [
                'formula',
                'encyclopedia.material.formula',
                'encyclopedia.material.formula_reduced'
            ]
        },
        'pagination': {
            'page_size': 10,
            'page_after_value': '----9KNOtIZc9bDFEWxgjeSRsJrC'
        }
    })

print(json.dumps(response.json(), indent=2))

This does not go through are archive files and only works on top of our search index. Much faster, but only the tip of the available information.

acarnevali · January 15, 2022, 11:56am

A couple follow-up questions about the Total energy topic:

Why is the total_energy metric not a valuable stability proxy?
Is energy_total_T0_per_atom still supported for a limited number of entries?
What metric would you suggest using in order to gain a reasonable stability proxy supported for a large number of materials?

laurih · January 15, 2022, 7:16pm

Hi @acarnevali,

energy_total is the “raw” energy as reported by the calculation. It does not really carry any stability information as the values are not normalized to any stability reference. It is also an extensive quantity and its values depend on the used methodology (e.g. the XC functional).

energy_total_T0_per_atom is reported by some codes, but it is also not suitable as a stability measure.

Producing a valuable energy stability metric typically involves additional processing that is not done in most calculations by default. It would be really great if we could as a post-processing step add such a stability measure for all calculations, but this becomes incredibly complicated as we are dealing with so many different software and methodologies. There may be some workflows or DFT software that directly reports e.g. formation energies for bulk systems If you are aware of any source for such information, please let us know and we can consider adding the metainfo and parsing for it.

sgbaird · January 21, 2022, 11:24pm

@mscheidgen @laurih @acarnevali For the stability measures, one option might be to train an ML model on a large database of formation energies, predict the formation energies for NOMAD materials, and use a phase diagram tool to calculate the decomposition energy (or similarly, energy above hull) based on the predicted formation energies, similar to what’s described in DOI: 10.1002/adma.202005112. Then use e_above_hull as a filtering criterion and/or to classify stable vs. non-stable in a “likelihood of stability” sense. At this point, the list of candidates may be sufficiently reduced to “high likelihood of synthesizability” compounds suitable for e.g. arc-melting.

Some literature and tools related to stability and synthesizability:

Wen, C.; Zhang, Y.; Wang, C.; Xue, D.; Bai, Y.; Antonov, S.; Dai, L.; Lookman, T.; Su, Y. Machine Learning Assisted Design of High Entropy Alloys with Desired Property. Acta Materialia 2019, 170, 109–117. Redirecting.
Zhang, Z.; Mansouri Tehrani, A.; Oliynyk, A. O.; Day, B.; Brgoch, J. Finding the Next Superhard Material through Ensemble Learning. Adv. Mater. 2021, 33 (5), 2005112. https://doi.org/10.1002/adma.202005112. (referenced above)
Falkowski, A. R.; Kauwe, S. K.; Sparks, T. D. Optimizing Fractional Compositions to Achieve Extraordinary Properties. Integrating Materials and Manufacturing Innovation 2021. https://doi.org/10.1007/s40192-021-00242-3.
Szczypiński, F. T.; Bennett, S.; Jelfs, K. E. Can We Predict Materials That Can Be Synthesised? Chem Sci 12 (3), 830–840. Can we predict materials that can be synthesised? - Chemical Science (RSC Publishing).
Therrien, F.; Jones, E. B.; Stevanović, V. Metastable Materials Discovery in the Age of Large-Scale Computation. Applied Physics Reviews 2021, 8 (3), 031310. Cookie Absent.
Wang, H.-C.; Botti, S.; Marques, M. A. L. Predicting Stable Crystalline Compounds Using Chemical Similarity. npj Comput Mater 2021, 7 (1), 1–9. Predicting stable crystalline compounds using chemical similarity | npj Computational Materials.
Agrawal, A.; Meredig, B.; Wolverton, C.; Choudhary, A. A Formation Energy Predictor for Crystalline Materials Using Ensemble Data Mining. In 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW); 2016; pp 1276–1279. A Formation Energy Predictor for Crystalline Materials Using Ensemble Data Mining | IEEE Conference Publication | IEEE Xplore.
Bartel, C. J.; Trewartha, A.; Wang, Q.; Dunn, A.; Jain, A.; Ceder, G. A Critical Examination of Compound Stability Predictions from Machine-Learned Formation Energies. npj Comput Mater 2020, 6 (1), 97. A critical examination of compound stability predictions from machine-learned formation energies | npj Computational Materials.

Jang, J.; Gu, G. H.; Noh, J.; Kim, J.; Jung, Y. Structure-Based Synthesizability Prediction of Crystals Using Partially Supervised Learning. J. Am. Chem. Soc. 2020, 142 (44), 18836–18843. https://doi.org/10.1021/jacs.0c07384.

(EDIT) Aykol, M.; Montoya, J. H.; Hummelshøj, J. Rational Solid-State Synthesis Routes for Inorganic Materials. J. Am. Chem. Soc. 2021, 143 (24), 9244–9259. https://doi.org/10.1021/jacs.1c04888.

https://piro.matr.io/ (slick web-app version, currently requires registration and approval)

(EDIT) Peterson, G. G. C.; Brgoch, J. Materials Discovery through Machine Learning Formation Energy. J. Phys. Energy 2021, 3 (2), 022002. ShieldSquare Captcha.

https://www.matlearn.org/

(EDIT) xref: Literature references associated with chemical formulas (or compounds) (mostly in the order in which it was suggested)

And some related tools:

laurih · January 24, 2022, 2:01pm

Thanks a lot for the input @sgbaird!

You are absolutely right, ML might be a very good (and possibly the only) way for us to provide a reasonable stability proxy. @mscheidgen: This is definitely something we should discuss internally and see how feasible it would be to implement.

sgbaird · March 7, 2022, 9:33pm

Snapshot of chemical formulas available on Figshare via my nomad-examples v0.2.0. CSV of 764431 unique formulas out of 11680557 entries. The unique reduced formulas contain 695612 unique chemical formulas, which were parsed using pymatgen.core.Composition. See unique-reduced-formula.csv

EDIT: the links and text above have been updated to reflect the additional curation of reduced chemical formulas and adjusting or filtering out strange “formulas” (only 15 in total).

sgbaird · March 18, 2022, 2:55am

@laurih @mscheidgen ALIGNN (not affiliated) has shown some very nice results on formation energy predictions and MEGNet is another contender (though not shown on Matbench currently). There is also some relevant discussion at Potential for stability dataset in matbench v1.0 · Issue #104 · materialsproject/matbench · GitHub such as data that includes both unrelaxed and relaxed structures [2106.11132] Rapid Discovery of Stable Materials by Coordinate-free Coarse Graining

sgbaird · March 20, 2022, 3:51am

@acarnevali I updated the links and text above to reflect the new, more curated version (reduced chemical formulas). 695612 unique chemical formulas based on VASP DFT entries and after processing formulas with pymatgen

Crystal structure somewhere in the future…