A suggestion to improve Material Project Python API

Four-Q · April 28, 2024, 4:07am

Material Project is a very excellent project, and I get some useful data from it conveniently.
But I seemingly find that there’s a little point which can be improve.
The specific situation is like this:
I want to get all of the crystal_system corresponding to different mp-id in my dataframe.
So I use the python API to get crystal_system like this:

crystal_system = mpr.materials.summary.search(material_ids=df['mp-id'].tolist(), 
                            fields='symmetry'[0].symmetry.crystal_system
df['crystal_system'] = crystal_system

But I get a error, the length of crystal_system didn’t equal to the length of dataframe df. What I think is that when the api didn’t find a crystal_system for the mp-id, the api just ignores it and don’t return anything.

So I changed my code like this:

# add crystal_system for every mp-id in df
with MPRester(api_key) as mpr:
        for i, mpid in enumerate(df['mp-id']):
                try:
                        crystal_system = mpr.materials.summary.search(material_ids=mpid, fields='symmetry')[0].symmetry.crystal_system
                        df.loc[i, 'crystal_system'] = crystal_system
                except:
                        df.loc[i, 'crystal_system'] = None

But there is also a problem, that is, the efficiency of the second query is much slower. The first way just spend 30 seconds, but the second way spend about 13 minutes, that’s terrible.

So I suggests that when didn’t find the corresponding crystal_system, the api should return something(maybe a None).
Or If you have a better solution for this, please let’s me know, because I really can’t think of a better way.
Thank you very much!!!

tschaume · May 10, 2024, 9:20pm

Thank you for reaching out! We’re glad to hear that the Materials Project is a valuable resource for you.

Retrieving materials one-by-one is definitely discouraged due to its inefficiency and its potential for getting rate-limited and/or blocked.

I’m not sure whether it’s a typo or not, but in your first code snippet, you’re only using the first result of the search (i.e. [0]), and are then trying to assign its crystal system to a DataFrame column. That’s destined to fail regardless of whether all material IDs in your query exist on MP or not.

The solution to ensure that entries in the crystal_systems column match your list of materials in the mp-id column is to also retrieve the material_id field and use it map IDs to their according crystal systems. See the following code snippet.

material_ids = ["mp-4", "mp-0"]  # df["mp-id"].to_list()

with MPRester(APIKEY) as mpr:
    docs = mpr.materials.summary.search(
        material_ids=material_ids,
        fields=["material_id", "symmetry.crystal_system"]
    )

map_dict = {
    doc.material_id: str(doc.symmetry.crystal_system)
    for doc in docs
}
df["crystal_systems"] = [map_dict.get(mpid) for mpid in material_ids]

HTH