I noticed that when I try to extract all the data in the MP database via:
FIELDS=["material_id","formula_pretty","elements","nelements","nsites","database_IDs","energy_above_hull","band_gap"]
mpr.materials.summary.search(fields=FIELDS, chunk_size=1000, num_chunks=num_chunks)
the results are different when using num_chunks=None (which according to the docs should extract all chunks) and num_chunks = some large number such as 100000000. More specifically, when using num_chunks=None I get 210,578 entries returned that all do not have any ICSD id’s versus when I use num_chunks=100000000 I get 154,879 where 51,619 of those do actually have ICSD id’s. Any guidance on why the difference occurs?
Hey @aaronzhu there’s a slight issue with our summary collection data on AWS. When you pass no arguments to mpr.materials.summary.search, that query gets routed to AWS, whereas passing some arguments will route to MongoDB, where that data issue has already been corrected.
The ~60,000 “missing” materials are the GNoME materials, which licensed CC-BY-NC. It sounds like you may have not accepted the terms of service for accessing the GNoME materials
We'll work on fixing this soon, if you need an immediate workaround, this will work.
from mp_api.client import MPRester
with MPRester() as mpr:
docs = docs = mpr.materials.summary.search(volume=(0,1e20),fields=["material_id","database_IDs"])
Thanks for the information! It would be great to know when the fix has been made.
Using the workaround, I get 200,487 entries where 51,671 have ICSD tags. Seems like there still may be ~10K entries missing from the 210,578 I got from before (when using num_chunks=None), but should be sufficient for now. Appreciate the help!
The other 10k should be GNoME materials or deprecated materials. The GNoME materials won’t have database_IDs, since they’re not experimental; the deprecated materials are likely physically unrealistic so you may want to exclude those anyways
Hi @aaronzhu , sorry for the mixup on the full download, and thanks @Aaron_Kaplan for the quick workaround.
The fix of the data on AWS is complete, you should be able to retrieve the full dataset with an empty .search() query and the documents will have the database_IDs field as expected
>>> with MPRester(use_document_model=False, monty_decode=False) as mpr:
... local_ds = mpr.materials.summary.search()
>>> next(filter(lambda x: x["material_id"] == "mp-149", local_ds))["database_IDs"]
{'icsd': ['icsd-76268', 'icsd-181356', 'icsd-659044', 'icsd-60388', 'icsd-652258', 'icsd-652257', 'icsd-67788', 'icsd-51688', 'icsd-652265', 'icsd-29287', 'icsd-43610', 'icsd-52266', 'icsd-52457', 'icsd-43403', 'icsd-53783', 'icsd-181907', 'icsd-181355', 'icsd-652255', 'icsd-60389', 'icsd-60385', 'icsd-182730', 'icsd-29288', 'icsd-150530', 'icsd-94261', 'icsd-60386', 'icsd-60387', 'icsd-426975', 'icsd-41979', 'icsd-53782', 'icsd-191759']}