Downloading Site-Specific XAS Spectra (FEFF) with Structures and Full Properties

Hi there!
Hi everyone,

My goal is to bulk-download all site-specific XAS spectra (FEFF computations) from the Materials Project, along with their corresponding structures and the full set of available material properties.

I’ve run into two issues that are making this harder than expected, and I’d love some guidance.

Using the standard XAS search endpoint returns site-averaged spectra, not the individual site-specific ones:
mpr.materials.xas.search(all_fields=True)
Is there a parameter or a different endpoint that exposes site-specific (per-absorbing-site) spectra?

The material_id values returned in XAS documents appear to be task IDs rather than canonical material IDs. To retrieve the corresponding summary document I currently have to call:
mpr.get_material_id_from_task_id(material_id)
…once per material, which is extremely slow at scale.

Is there a vectorised/batch alternative, or a way to join XAS results directly to summary documents without this per-record lookup?

Any suggestions for a clean, efficient workflow to retrieve XAS data + structure + properties?

Thanks in advance.

Hi @Junki

Can’t answer your question re: site-averaged vs. site-specific spectra, others with more knowledge on that will have to chime in.

documents appear to be task IDs

Correct, for the XAS collection material_id is actually a task_id. This is an older collection that needs to be migrated/updated.

join XAS results directly to summary documents without this per-record lookup

Unfortunately not at this time, task_id to material_id lookup cannot easily be done at scale right now due to the many-to-one relationship of task_ids to material_ids.

I can however help you a bit to do better retrieval of all the structures for the XAS task_ids rather than submitting one request at a time using some alpha features/data products we’ve been working on.

The following script requires an additional library (pip install deltalake) to utilize the ‘lake’ part of MP’s data lakehouse. You’ll also need a recent emmet-core with pyarrow, and python 3.12 for batched from itertools. (could also copy the rough implementation at iterools.batched if you need to be <3.12)

Consider this “experimental” for now, hopefully we can better leverage these types of things internally in the mp_api client in the future for user convenience.

from itertools import batched

import pyarrow as pa
from deltalake import DeltaTable, QueryBuilder
from emmet.core.arrow import arrowize
from emmet.core.mpid import AlphaID
from emmet.core.types.pymatgen_types.structure_adapter import StructureType
from mp_api.client import MPRester
from pydantic import TypeAdapter

tasks_uri = "s3a://materialsproject-parsed/core/tasks"


def retrieve_xas_structures(batch: list[str]) -> pa.Table:
    # creating these here (Table+QueryBuilder) assuming
    # this will be parallelized, if ran in a loop in one process
    # just make these global to avoid rebuilding them every function call
    tasks_tbl = DeltaTable(
        tasks_uri,
        storage_options={"AWS_SKIP_SIGNATURE": "true", "AWS_REGION": "us-east-1"},
    )
    tasks_qb = QueryBuilder().register("tasks", tasks_tbl)

    query_str = ",".join([f"'{tid}'" for tid in batch])

    return pa.table(
        tasks_qb.execute(
            f"""
        SELECT task_id
             , structure
        FROM   tasks
        WHERE  task_id in ({query_str});
    """
        ).read_all(),
        schema=pa.schema(
            [
                pa.field("task_id", pa.string()),
                pa.field("structure", arrowize(StructureType)),
            ]
        ),
    )


with MPRester() as mpr:
    xas_task_ids = set(
        # multiple spectra per task
        # material_id is task_id for this old collection
        [doc["material_id"] for doc in mpr.materials.xas.search(fields=["material_id"])]
    )

# convert legacy id format to internal AlphaID format
# len(as_alpha) == ~60k
as_alpha = [str(AlphaID(tid, padlen=8)).split("-")[-1] for tid in xas_task_ids]


arrow_tables = []
for batch in batched(as_alpha, 3000):  # adjust batch size as appropriate
    # can be submitted to distributed compute and then gathered:
    # concurrent.futures, Dask, Ray, etc,
    arrow_tables.append(retrieve_xas_structures(batch))

all_structures_table = pa.concat_tables(arrow_tables)
# should likely write this to a local file for later use
# import pyarrow.parquet as pq
# pq.write_table(
#    all_structures_table,
#    "xas_structures.parquet.zstd",
#    compression="ZSTD"
# )

# can deserialize all to pmg, or filter, join, etc. first
pmg_structures = TypeAdapter(list[StructureType]).validate_python(
    all_structures_table["structure"].to_pylist(maps_as_pydicts="strict")
)

(see note here about arrow deserialization)

If you were to distribute the batches in that core loop with a library like Dask you could get a pretty significant speed up. Could also just run the loop as is :shrug:


That aside, I’ll think more about how we can make the bulk task_id to material_id lookup more ergonomic.

Thank you so much for your fast response. I did not run your script yet but it is essentially getting the structures for the XAS task_id in a nice batched (sql query) way right? Could I also run something like this to just get the task_id to material_id mapping? Since if I just store this mapping locally I would not need to have the mpr.get_material_id_from_task_id(material_id)calls.

Also with regards to the site-specific spectra question I can explain a little more: Maybe you then have an idea. Basically it should be the case that there is a spectra for each atom site in the material. When using the API I think these individual spectra are getting averaged over all atoms of a certain element. So basically we get a single spectra not per atomic site but per element. Maybe there exists a way to retrieve every actual FEFF computation spectrum (that is atom-site-specific).

Hi @Junki the site-specific spectra aren’t available through the API right now. That’s something we’ve separately been contacted about from the FEFF developers and will try to add in the future from our existing data

For a python way to get the task ID to material ID lookup:

from mp_api.client import MPRester

with MPRester() as mpr:
    mat_docs = mpr.materials.search(fields=["material_id","structure","task_ids"])

material_id_to_structure = {doc.material_id.string: doc.structure for doc in mat_docs}
task_id_to_material_id = {task_id : doc.material_id.string for doc in mat_docs for task_id in doc.task_ids}

Hi @Aaron_Kaplan, is there any way of getting the raw data of the site-specific spectra?
It would be very valuable for me and my doctoral research.

Yeah it should be in our raw data - will coordinate with the rest of the team and keep you posted!

1 Like