MP API Seems to Return Materials which are not in the Database

Thomas_Warford · October 2, 2024, 8:46pm

Hi all,

I was trying to do some analysis on the distribution of elements in materials project today when I noticed something.

When I query the database for the material_ids and elements of all materials, and then select materials containing hydrogen I get 10,449 materials.

When I query the database for materials containing just hydrogen I get 10,394 materials (which matches the website).

When I lookup the 55 IDs which are only obtained by the first approach, I don’t find anything.

Here’s the code to reproduce this, sorry if it’s messy, it was a jupyter notebook originally:

from mp_api.client import MPRester
from pathlib import Path

API_KEY = ""

with MPRester(API_KEY) as mpr:
    docs = mpr.materials.summary.search(fields=["material_id", "elements"])

mp_elements = {}
for doc in docs:
    mp_elements[doc.material_id.string] = [element.number for element in doc.elements]

cutoff_element = 18 # Argon (two rows of periodic table)

materials_by_element = {}

for element in range(1, cutoff_element+1):
    
    materials_by_element[element] = []
    
    for (mp_id, elements) in mp_elements.items():
        if element in elements:
            materials_by_element[element].append(mp_id)
    
counts = {element: len(mp_ids) for element, mp_ids in materials_by_element.items()}
print("counts per element", counts)

with MPRester(API_KEY) as mpr:
    docs = mpr.materials.summary.search(
        elements=["H"], fields=["material_id", "formula_pretty"]
    )
    mpid_formula_dict = {
        doc.material_id: doc.formula_pretty for doc in docs
    }
print("number of materials from approach 2:", len(docs))

h_ids = [id.string for id in mpid_formula_dict.keys()]
weird_ids = [id for id in materials_by_element[1] if not (id in h_ids)]

print("number of weird IDs:", len(weird_ids))

print(weird_ids)

with MPRester(API_KEY) as mpr:
    docs = mpr.materials.summary.search(
        material_ids=[id for id in materials_by_element[1] if not (id in h_ids)]
    )

print("downloaded weird ID docs:", docs)

Here’s the code’s output, with the hydrogen-containing materials I can’t find elsewhere.

Retrieving SummaryDoc documents:
155361/? [06:52<00:00, 430.81it/s]

counts per element {1: 10449, 2: 8, 3: 21761, 4: 1189, 5: 6370, 6: 9083, 7: 11442, 8: 82406, 9: 12136, 10: 1, 11: 12873, 12: 19084, 13: 7805, 14: 12758, 15: 16913, 16: 15397, 17: 6425, 18: 2}

Retrieving SummaryDoc documents: 100%
10394/10394 [00:02<00:00, 3904.72it/s]

number of materials from approach 2: 10394
number of weird IDs: 55
['mp-697915', 'mp-1187975', 'mp-632667', 'mp-634930', 'mp-634751', 'mp-864603', 'mp-625103', 'mp-626421', 'mp-632348', 'mp-1070852', 'mp-2646948', 'mp-1025273', 'mp-1103732', 'mp-626413', 'mp-643108', 'mp-1207586', 'mp-740759', 'mp-1206323', 'mp-1018646', 'mp-1018647', 'mp-1187892', 'mp-1207571', 'mp-1207559', 'mp-979964', 'mp-1198634', 'mp-1195507', 'mp-1105386', 'mp-1216487', 'mp-643246', 'mp-1195012', 'mp-1195544', 'mp-1203501', 'mp-643071', 'mp-1200022', 'mp-705525', 'mp-555985', 'mp-1202633', 'mp-1202946', 'mp-1202882', 'mp-1198247', 'mp-1238179', 'mp-1200794', 'mp-1191250', 'mp-697925', 'mp-699393', 'mp-1212344', 'mp-722346', 'mp-1203140', 'mp-1200555', 'mp-1202119', 'mp-1193866', 'mp-1190437', 'mp-1198865', 'mp-1200481', 'mp-1200272']

Retrieving SummaryDoc documents:
0/0 [00:00<?, ?it/s]

downloaded weird ID docs: []

The same happens for other elements. I can supply more IDs if you’d like.

Please let me know if I’m missing something here!

Thanks

tschaume · October 2, 2024, 9:13pm

Yes, that’s a quirk in the client we’re aware of (also see here). The 55 materials are deprecated but returned when running mpr.materials.summary.search() without a query. You should be able to add deprecated to the fields argument and use it to post-filter. HTH.

Thomas_Warford · October 2, 2024, 10:04pm

That seems like the best way to do things. Thanks for the help

Thomas_Warford · October 2, 2024, 10:34pm

Strangely, doc.deprecated is always false for the first query (see below). I think I’ll make individual queries for each element instead.

with MPRester(API_KEY) as mpr:
    docs = mpr.materials.summary.search(fields=["material_id", "elements", "deprecated"])

mp_elements = {}
for doc in docs:
    mp_elements[doc.material_id.string] = [element.number for element in doc.elements if not doc.deprecated]
    if doc.deprecated != False:
        print(doc.deprecated)