Downloading Site-Specific XAS Spectra (FEFF) with Structures and Full Properties

Junki · March 16, 2026, 2:13pm

Hi there!
Hi everyone,

My goal is to bulk-download all site-specific XAS spectra (FEFF computations) from the Materials Project, along with their corresponding structures and the full set of available material properties.

I’ve run into two issues that are making this harder than expected, and I’d love some guidance.

Using the standard XAS search endpoint returns site-averaged spectra, not the individual site-specific ones:
mpr.materials.xas.search(all_fields=True)
Is there a parameter or a different endpoint that exposes site-specific (per-absorbing-site) spectra?

The material_id values returned in XAS documents appear to be task IDs rather than canonical material IDs. To retrieve the corresponding summary document I currently have to call:
mpr.get_material_id_from_task_id(material_id)
…once per material, which is extremely slow at scale.

Is there a vectorised/batch alternative, or a way to join XAS results directly to summary documents without this per-record lookup?

Any suggestions for a clean, efficient workflow to retrieve XAS data + structure + properties?

Thanks in advance.

tsmathis · March 16, 2026, 5:18pm

Hi @Junki

Can’t answer your question re: site-averaged vs. site-specific spectra, others with more knowledge on that will have to chime in.

documents appear to be task IDs

Correct, for the XAS collection material_id is actually a task_id. This is an older collection that needs to be migrated/updated.

join XAS results directly to summary documents without this per-record lookup

Unfortunately not at this time, task_id to material_id lookup cannot easily be done at scale right now due to the many-to-one relationship of task_ids to material_ids.

I can however help you a bit to do better retrieval of all the structures for the XAS task_ids rather than submitting one request at a time using some alpha features/data products we’ve been working on.

The following script requires an additional library (pip install deltalake) to utilize the ‘lake’ part of MP’s data lakehouse. You’ll also need a recent emmet-core with pyarrow, and python 3.12 for batched from itertools. (could also copy the rough implementation at iterools.batched if you need to be <3.12)

Consider this “experimental” for now, hopefully we can better leverage these types of things internally in the mp_api client in the future for user convenience.

from itertools import batched

import pyarrow as pa
from deltalake import DeltaTable, QueryBuilder
from emmet.core.arrow import arrowize
from emmet.core.mpid import AlphaID
from emmet.core.types.pymatgen_types.structure_adapter import StructureType
from mp_api.client import MPRester
from pydantic import TypeAdapter

tasks_uri = "s3a://materialsproject-parsed/core/tasks"


def retrieve_xas_structures(batch: list[str]) -> pa.Table:
    # creating these here (Table+QueryBuilder) assuming
    # this will be parallelized, if ran in a loop in one process
    # just make these global to avoid rebuilding them every function call
    tasks_tbl = DeltaTable(
        tasks_uri,
        storage_options={"AWS_SKIP_SIGNATURE": "true", "AWS_REGION": "us-east-1"},
    )
    tasks_qb = QueryBuilder().register("tasks", tasks_tbl)

    query_str = ",".join([f"'{tid}'" for tid in batch])

    return pa.table(
        tasks_qb.execute(
            f"""
        SELECT task_id
             , structure
        FROM   tasks
        WHERE  task_id in ({query_str});
    """
        ).read_all(),
        schema=pa.schema(
            [
                pa.field("task_id", pa.string()),
                pa.field("structure", arrowize(StructureType)),
            ]
        ),
    )


with MPRester() as mpr:
    xas_task_ids = set(
        # multiple spectra per task
        # material_id is task_id for this old collection
        [doc["material_id"] for doc in mpr.materials.xas.search(fields=["material_id"])]
    )

# convert legacy id format to internal AlphaID format
# len(as_alpha) == ~60k
as_alpha = [str(AlphaID(tid, padlen=8)).split("-")[-1] for tid in xas_task_ids]


arrow_tables = []
for batch in batched(as_alpha, 3000):  # adjust batch size as appropriate
    # can be submitted to distributed compute and then gathered:
    # concurrent.futures, Dask, Ray, etc,
    arrow_tables.append(retrieve_xas_structures(batch))

all_structures_table = pa.concat_tables(arrow_tables)
# should likely write this to a local file for later use
# import pyarrow.parquet as pq
# pq.write_table(
#    all_structures_table,
#    "xas_structures.parquet.zstd",
#    compression="ZSTD"
# )

# can deserialize all to pmg, or filter, join, etc. first
pmg_structures = TypeAdapter(list[StructureType]).validate_python(
    all_structures_table["structure"].to_pylist(maps_as_pydicts="strict")
)

(see note here about arrow deserialization)

If you were to distribute the batches in that core loop with a library like Dask you could get a pretty significant speed up. Could also just run the loop as is :shrug:

That aside, I’ll think more about how we can make the bulk task_id to material_id lookup more ergonomic.

Junki · March 16, 2026, 6:46pm

Thank you so much for your fast response. I did not run your script yet but it is essentially getting the structures for the XAS task_id in a nice batched (sql query) way right? Could I also run something like this to just get the task_id to material_id mapping? Since if I just store this mapping locally I would not need to have the mpr.get_material_id_from_task_id(material_id)calls.

Also with regards to the site-specific spectra question I can explain a little more: Maybe you then have an idea. Basically it should be the case that there is a spectra for each atom site in the material. When using the API I think these individual spectra are getting averaged over all atoms of a certain element. So basically we get a single spectra not per atomic site but per element. Maybe there exists a way to retrieve every actual FEFF computation spectrum (that is atom-site-specific).

Aaron_Kaplan · March 16, 2026, 8:21pm

Hi @Junki the site-specific spectra aren’t available through the API right now. That’s something we’ve separately been contacted about from the FEFF developers and will try to add in the future from our existing data

For a python way to get the task ID to material ID lookup:

from mp_api.client import MPRester

with MPRester() as mpr:
    mat_docs = mpr.materials.search(fields=["material_id","structure","task_ids"])

material_id_to_structure = {doc.material_id.string: doc.structure for doc in mat_docs}
task_id_to_material_id = {task_id : doc.material_id.string for doc in mat_docs for task_id in doc.task_ids}

Junki · March 16, 2026, 8:54pm

Hi @Aaron_Kaplan, is there any way of getting the raw data of the site-specific spectra?
It would be very valuable for me and my doctoral research.

Aaron_Kaplan · March 16, 2026, 9:10pm

Yeah it should be in our raw data - will coordinate with the rest of the team and keep you posted!

Junki · March 18, 2026, 9:33am

Hi @Aaron_Kaplan,
I just fetched all XAS data:

with MPRester(api_key, use_document_model=False) as mpr:
    xas_data = list(mpr.materials.xas.search(material_ids=task_id_batch, all_fields=True))

This gives me the xas spectra with the structures inside the "spectrum" field.
Now I wanted to compare those structures with the ones I get by mapping the "task_id" to "material_id" and fetch:

with MPRester(api_key, use_document_model=False) as mpr:
    summary_data = list(mpr.materials.summary.search(material_ids=material_id_batch, all_fields=True))

Problem:
The structures do not match exactly. I see that they are roughly the same material however the exact "xyz" positions of sites or "abc" do not match.
What is the reason for this? Which one is correct (used for actual spectra computation)? And also which implication does this have for other material properties? Many properties are computed based on the exact geometry of the material. It now seems to me that this geometry is not always the same?

Thank you in advance for any insight

Aaron_Kaplan · March 18, 2026, 3:38pm

The structures that are used in different endpoints may not match because they are performed at different times, but the structures themselves are similar enough that they represent essentially the same material

A given property (e.g., bandgap, energy above hull, piezoelectricity) is linked to a specific task, which has a specific structure. Ideally, we would recompute all properties when a new geometry is found, but that is infeasible from a computational standpoint

The actual structure which was used in the XAS calculation is the one in the XAS document

Junki · March 18, 2026, 5:25pm

Perfect! Thank you so much. This is exactly the information I needed.
In case you have some update on the site-specific spectra hit me up
Im very grateful for your help!

Junki · March 26, 2026, 12:08pm

Hi @Aaron_Kaplan,
is there any update on the site-specific data?
Greetings

tsmathis · March 26, 2026, 3:55pm

Hi @Junki, there is currently no ETA on when we’ll have time to get to this.