Get all IDs, space groups and chemical formulas

I need to get all the mp-XXXXX format identifiers, as well as the space group and chemical formula. This is necessary for training models.
Based on the documentation, I understand how to get information for specific ID or class. But how can I get all possible identifiers and then get the symmetry and formula from them?

Hi @alinzh, here’s a general script to parse this info from MP and dump it to a JSON file.
space_group_HM is a string representing the Hermann–Mauguin space group, space_group_number is an integer for the international space group number, formula and composition provide the structure’s composition, but the former is a string and the latter is a dict.

from mp_api.client import MPRester
from monty.serialization import dumpfn

with MPRester("api_key") as mpr:
    docs = mpr.materials.summary.search(fields = ["structure","material_id"])

data = [
    {
        "MPID": str(doc.material_id),
        "space_group_HM": doc.structure.get_space_group_info()[0],
        "space_group_number": doc.structure.get_space_group_info()[1],
        "formula": doc.structure.formula,
        "composition": doc.structure.composition.as_dict()
    } for doc in docs
]

dumpfn(data,"training_data.json.gz")

You can also pull documents from MPRester().materials.tasks instead of summary. In that case, change material_id to task_id.

1 Like