Material id present in list of all valid tasks, but not in list of entries

noam.bernstein · December 24, 2024, 12:55am

Material id mp-1249732 returns nothing in the MP web site, is not in the list of all entries I downloaded with mpr.materials.thermo.search(), but several of the task ids associated with it (including the material id itself)

mp-1249732
mp-1263333
mp-1872191

are in the list of valid entries I downloaded with pd.read_parquet(f"s3://materialsproject-build/collections/{DB_VERSION}/task-validation/manifest.parquet") filtering for .valid == True, and DB_VERSION = '2024-11-14'

Why would a valid task_id have a material id that’s not in the entry database?

noam.bernstein · December 25, 2024, 6:18pm

In a related note, what would be the reasons this material id might have no entry to begin with (independent of the task validity)? I know all the old Yb results were removed due to the PAW, but that’s definitely not the issue here (Al-Cu-Sb-Si-O)

Aaron_Kaplan · January 2, 2025, 4:32pm

The ID mp-1249732 is a non-deprecated task ID, and its parent material ID mp-1044336 is in MP:

from mp_api.client import MPRester

with MPRester() as mpr:
    mat_doc = mpr.materials.search(task_ids=['mp-1249732'])[0]
print(mat_doc.material_id)
>>> MPID(mp-1044336)
print(sorted(task_id for task_id in mat_doc.task_ids if task_id not in mat_doc.deprecated_tasks))
>>> [MPID(mp-1044336), MPID(mp-1249732), MPID(mp-1252077), MPID(mp-1252112), MPID(mp-1257436), MPID(mp-1259927), MPID(mp-1263333), MPID(mp-1353680), MPID(mp-1872191), MPID(mvc-9133)]

For building thermo docs, we often have multiple tasks with the same run type (e.g., PBE GGA static). Within a given run type, only one task is used, corresponding to the one with lowest energy per atom.

noam.bernstein · January 2, 2025, 6:32pm

Thanks for clarifying. I guess the fundamental problem is that in mptrj, that task id (mp-1249732) was labelled with the wrong (i.e. the same) material id. Yet another issue to watch out for in mptrj.

However, since I want to be able fix these issues myself, I now realize that I don’t know how to find out the correct material id for each task. I don’t see the parent material id you associated with it (mp-1044336) anywhere in this task’s records in the downloaded all tasks directory. How do I determine the correct parent material id?

Aaron_Kaplan · January 2, 2025, 6:48pm

The set of material IDs is a subset of the task IDs, so given a generic MPID, you can do this from the API:

with MPRester() as mpr:
   mat_doc = mpr.materials.search(task_ids=[MPID])[0]
is_material = mat_doc.material_id == MPID

If is_material is True, then MPID corresponds both to a task ID and the material ID. If it’s False, then MPID only corresponds to a task ID.

Building off of that, you can get a mapping of all non-deprecated task IDs to their parent material ID:

mapping = {task_id: mat_doc.material_id for task_id in mat_doc.task_ids if task_id not in mat_doc.deprecated_tasks}

noam.bernstein · January 2, 2025, 7:44pm

Thanks. To avoid many further queries on your server, can I bulk download this for all materials (ideally restricted to just the task_ids and deprecated_tasks lists, I guess)? If so, what do I pass mpr.materials.search(...)?

Aaron_Kaplan · January 2, 2025, 7:57pm

mpr.materials.search() will retrieve all of the documents in an efficient way, thanks for checking!

noam.bernstein · January 2, 2025, 9:08pm

Thanks. I’ll try to do it a minimal number of times and cache the results.

Aaron_Kaplan · January 2, 2025, 10:06pm

I suggest just running once to pull all documents and then saving them in blocks if need be. For example, if you only need the task ID to material ID mapping:

from monty.serialization import dumpfn

with MPRester() as mpr:
   mat_docs = mpr.materials.search(fields=["material_id","task_ids","deprecated_tasks"])
material_to_tasks = {
    mat_doc.material_id.string : [task_id.string for task_id in mat_doc.task_ids if task_id not in mat_doc.deprecated_tasks]
    for mat_doc in mat_docs
}
dumpfn(material_to_tasks,"material_id_to_task_ids.json.gz")

which will save it to a gzipped json.

noam.bernstein · January 3, 2025, 2:41pm

Thanks - is there any prefab way to dump the entire mat_doc, e.g. convert it to a nested dict so I can just use json?

Aaron_Kaplan · January 3, 2025, 4:18pm

Yes, using dumpfn as in the code snippet in my previous message:

from monty.serialization import dumpfn

dumpfn(mat_docs,<file name goes here>.json)

tschaume · January 7, 2025, 8:38pm

@noam.bernstein In addition to @Aaron_Kaplan’s solutions, you can aslo directly download the underlying .jsonl.gz files from our AWS OpenData repo using the AWS CLI. See AWS OpenData | Materials Project Documentation