Hi @Junki
Can’t answer your question re: site-averaged vs. site-specific spectra, others with more knowledge on that will have to chime in.
documents appear to be task IDs
Correct, for the XAS collection material_id is actually a task_id. This is an older collection that needs to be migrated/updated.
join XAS results directly to summary documents without this per-record lookup
Unfortunately not at this time, task_id to material_id lookup cannot easily be done at scale right now due to the many-to-one relationship of task_ids to material_ids.
I can however help you a bit to do better retrieval of all the structures for the XAS task_ids rather than submitting one request at a time using some alpha features/data products we’ve been working on.
The following script requires an additional library (pip install deltalake) to utilize the ‘lake’ part of MP’s data lakehouse. You’ll also need a recent emmet-core with pyarrow, and python 3.12 for batched from itertools. (could also copy the rough implementation at iterools.batched if you need to be <3.12)
Consider this “experimental” for now, hopefully we can better leverage these types of things internally in the mp_api client in the future for user convenience.
from itertools import batched
import pyarrow as pa
from deltalake import DeltaTable, QueryBuilder
from emmet.core.arrow import arrowize
from emmet.core.mpid import AlphaID
from emmet.core.types.pymatgen_types.structure_adapter import StructureType
from mp_api.client import MPRester
from pydantic import TypeAdapter
tasks_uri = "s3a://materialsproject-parsed/core/tasks"
def retrieve_xas_structures(batch: list[str]) -> pa.Table:
# creating these here (Table+QueryBuilder) assuming
# this will be parallelized, if ran in a loop in one process
# just make these global to avoid rebuilding them every function call
tasks_tbl = DeltaTable(
tasks_uri,
storage_options={"AWS_SKIP_SIGNATURE": "true", "AWS_REGION": "us-east-1"},
)
tasks_qb = QueryBuilder().register("tasks", tasks_tbl)
query_str = ",".join([f"'{tid}'" for tid in batch])
return pa.table(
tasks_qb.execute(
f"""
SELECT task_id
, structure
FROM tasks
WHERE task_id in ({query_str});
"""
).read_all(),
schema=pa.schema(
[
pa.field("task_id", pa.string()),
pa.field("structure", arrowize(StructureType)),
]
),
)
with MPRester() as mpr:
xas_task_ids = set(
# multiple spectra per task
# material_id is task_id for this old collection
[doc["material_id"] for doc in mpr.materials.xas.search(fields=["material_id"])]
)
# convert legacy id format to internal AlphaID format
# len(as_alpha) == ~60k
as_alpha = [str(AlphaID(tid, padlen=8)).split("-")[-1] for tid in xas_task_ids]
arrow_tables = []
for batch in batched(as_alpha, 3000): # adjust batch size as appropriate
# can be submitted to distributed compute and then gathered:
# concurrent.futures, Dask, Ray, etc,
arrow_tables.append(retrieve_xas_structures(batch))
all_structures_table = pa.concat_tables(arrow_tables)
# should likely write this to a local file for later use
# import pyarrow.parquet as pq
# pq.write_table(
# all_structures_table,
# "xas_structures.parquet.zstd",
# compression="ZSTD"
# )
# can deserialize all to pmg, or filter, join, etc. first
pmg_structures = TypeAdapter(list[StructureType]).validate_python(
all_structures_table["structure"].to_pylist(maps_as_pydicts="strict")
)
(see note here about arrow deserialization)
If you were to distribute the batches in that core loop with a library like Dask you could get a pretty significant speed up. Could also just run the loop as is :shrug:
That aside, I’ll think more about how we can make the bulk task_id to material_id lookup more ergonomic.