Query result is different when using legacy API vs. new API

johan28 · February 11, 2022, 11:16pm

Dear developer,

I tried to use both legacy API (pymatgen) and new API (mp_api) for extracting elasticity data. What I found out is that the retrieved data within the same chemical system is different between them.

Here is what I did:

for legacy API,

Screen Shot 2022-02-11 at 10.52.31 AM1025×186 53.5 KB
for new API,

Screen Shot 2022-02-11 at 10.15.41 AM1023×417 99.5 KB

As it shows, the number of data extracted is different. Legacy API gives 13 entries, but new API gives 2+4+14 = 20 entries. I checked whether beta MnO2 (mp-510408) is in the extracted data, and it is included in the new API, but it is not in the legacy one.

Are some of new calculated results not included in the legacy database?

mkhorton · February 11, 2022, 11:19pm

Hi @johan28,

Thanks for testing our new API

This is correct; the legacy API is frozen on database version v2020.09.08, and will be indefinitely. The new API is currently running on database version v2021.11.10, which incorporates various fixes and new data. See the get_db_version() method for how to retrieve the database version via code and the Materials Project Database Release Log for more information on historical versions.

Hope this helps,

Matt

johan28 · February 12, 2022, 12:03am

Thank you so much.
I have a follow-up question. I also tried to extract all the entries in [“Mn”, “O”] chemical spaces, and in this case, “mp-510408” is only found in the legacy api, rather than new one. This is opposite to the elasticity extraction results. For elasticity data, “mp-510408” is found in new api, but not in legacy one.

legacy api

Screen Shot 2022-02-11 at 3.31.39 PM1088×426 97.3 KB
new api

Screen Shot 2022-02-11 at 3.31.31 PM1128×181 38.4 KB

What could be the reason for these results?

mkhorton · February 12, 2022, 1:33am

There is a subtle but important semantic difference between task_id (identifier for a specific calculation) and material_id (identifier for a material, which aggregates multiple calculations).

Typically the material_id is the smallest/oldest task_id associated with a given material.

The new API is more careful about distinguishing between material_id and task_id, leading to slight differences from the legacy API.

from mp_api import MPRester

with MPRester() as mpr:
    thermo_doc = mpr.thermo.get_data_by_id("mp-510408")
    print(thermo_doc.material_id)  # gives mp-510408
    print(thermo_doc.entries["GGA"].data["task_id"])  # gives mp-1271735, the individual calculation

When you retrieve a list of entries using get_entries_in_chemsys, this collects all the entries from the thermo endpoint into a single list. This then contains the task_id for the calculation that was used to give the energy for each entry.

To get the material_id from the task_id, you can use:

mpr.get_materials_id_from_task_id("mp-1271735")  # returns "mp-510408"

To find all task_id associated with a given material_id you can try:

mpr.materials.get_data_by_id("mp-510408").task_ids  # returns a list of all ids

In future, it seems that we should store both the material_id and the task_id in the entry so that a lookup is not required. We will add this to our to-do list!

Apologies that this is somewhat confusing. Let me know if you have any further questions.

Best,

Matt