I need to get all the mp-XXXXX format identifiers, as well as the space group and chemical formula. This is necessary for training models.
Based on the documentation, I understand how to get information for specific ID or class. But how can I get all possible identifiers and then get the symmetry and formula from them?
Hi @alinzh, here’s a general script to parse this info from MP and dump it to a JSON file.
space_group_HM is a string representing the Hermann–Mauguin space group, space_group_number is an integer for the international space group number, formula and composition provide the structure’s composition, but the former is a string and the latter is a dict.
from mp_api.client import MPRester
from monty.serialization import dumpfn
with MPRester("api_key") as mpr:
docs = mpr.materials.summary.search(fields = ["structure","material_id"])
data = [
{
"MPID": str(doc.material_id),
"space_group_HM": doc.structure.get_space_group_info()[0],
"space_group_number": doc.structure.get_space_group_info()[1],
"formula": doc.structure.formula,
"composition": doc.structure.composition.as_dict()
} for doc in docs
]
dumpfn(data,"training_data.json.gz")
You can also pull documents from MPRester().materials.tasks instead of summary. In that case, change material_id to task_id.
1 Like