TL;DR
I want to use a large list of compounds for materials discovery (scoring/ranking the compounds), and I have a working implementation (see below), but I’m not sure if I’m looking at the right measure for stability, I haven’t figured out how to exclude certain compounds, and many fewer entries (~200k) are returned via the Python API Client compared with what I expected based on the GUI results (~8M)
EDIT: using conda-forge 0.10.4
to install produces only ~200k whereas using pip 0.10.4
produces the expected ~8M.
My Goal
Great resource! I was looking into using OPTIMADE, but ran into issues with quite a few of the APIs failing (which will hopefully resolve at some point in the future). After some discussion, I decided that NOMAD would be a really good resource to use directly instead. The main appeal for me is to get a list of candidates for materials discovery. Pulling chemical formulas from databases such as NOMAD seemed more fruitful than generating random compositions. After getting this list, I can then rank/sort based on a trade-off of their predicted performance and novelty using the DiSCoVeR algorithm (disclaimer: I am one of the authors). This is similar to what Bayesian optimization does in suggesting points that handle the trade-off between exploration and exploitation, except that DiSCoVeR puts specific emphasis on chemical novelty. In other words, I want to perform materials discovery based on a list of candidate compounds. The fact that NOMAD stands for “Novel Materials Discovery” is extra persuasive for me
What I Tried
I saw that NOMAD takes OPTIMADE search queries, and that for a GUI-based search query (at least if the “entries” tab is selected), you can press the <>
button to see the corresponding code. Without any search terms, I get the following:
from nomad import client, config
config.client.url = 'http://nomad-lab.eu/prod/rae/api'
results = client.query_archive(query={
'domain': 'dft'})
print(results)
I looked through the dropdown list of possible quantity=value
combinations.
I found NOMAD Meta Info and started browsing through the items there (really cool once I figured out what to do with the info).
As I kept looking through the Python API Client docs, I noticed the first example and started modifying this. In particular, I changed max=100
to max=None
and changed required=...
such that it only accesses energy_total
, chemical_composition_reduced
, and calc_id
.
I wasn’t able to figure out how to exclude certain elements from the query, though I do know the OPTIMADE filter that would produce the desired result:
excluded_elements = [
"He", "Ne", "Ar", "Kr", "Xe", "Rn", "U", "Th", "Rn", "Tc", "Po", "Pu", "Pa",
]
f"NOT (elements HAS ANY {excluded_elements})".replace("'", '"')
My Working Implementation
"""
A simple example that uses the NOMAD client library to access the archive.
Modified from source: https://nomad-lab.eu/prod/rae/docs/archive.html#first-example
to download all chemical formulas `chemical_composition_reduced` along with their calculation ids `calc_id` and `total_energy`.
"""
import pandas as pd
from nomad.client import ArchiveQuery
from nomad.metainfo import units
# %% query NOMAD database
query = ArchiveQuery(
# url="http://nomad-lab.eu/prod/rae/api",
query={"dft.code_name": "VASP"},
required={
"section_run": {
"section_single_configuration_calculation": {"energy_total": "*",},
"section_system": {"chemical_composition_reduced": "*"},
},
"section_metadata": {"calc_id": "*"},
},
per_page=100,
max=None,
)
print(query)
# %% extract values
hartree_total_energies = [
result.section_run[0]
.section_single_configuration_calculation[-1]
.energy_total.to(units.hartree)
for result in query
]
hartree_total_energy_values = [
hartree_total_energy.m for hartree_total_energy in hartree_total_energies
]
formulas = [
result.section_run[0].section_system[0].chemical_composition_reduced
for result in query
]
calc_ids = [result.section_metadata.calc_id for result in query]
# %% combine and save
df = pd.DataFrame(
{
"calc_id": calc_ids,
"formula": formulas,
"hartree_total_energy": hartree_total_energy_values,
}
)
df.to_csv("all-formula.csv", index=False)
Questions and Comments
- is
total_energy
(or perhaps total energy per atom) what I should be looking at for a measure of stability? Ideally, this would be something that I could use for filtering, similar to e.g.e_above_hull < 50 meV
for Materials Project - why are there so many fewer entries using the Python API Client (
263540
) compared to the GUI (8,208,077
) forquery={"dft.code_name": "VASP"}
even if I removerequired=...
entirely? Is this because I’m using the Python API Client and not one of the other methods? Or maybe the default URL? (EDIT: this seems to be because I was usingconda
instead ofpip
to install, despite each listing the same version:0.10.4
) - how do I exclude a list of certain elements?
- any other suggestions/modifications?
Sterling