Common IDs for distinct materials

blokhin · September 14, 2020, 11:11pm

Good day colleagues,

Sometimes people refer to the particular distinct materials in the Materials Project using the mp-id’s, see e.g. (I saw other examples as well).

OTOH except @mkhorton’s comment, I cannot find much technical details behind the mp-id’s, in particular, how they are exactly formulated and how should they be used for identification. Could you help me to understand that?

Please, let me note, that with the proliferation of the different materials databases, the common IDs for the same materials become very useful, see the picture below (credits 熊谷将也).

mkhorton · September 14, 2020, 11:34pm

For this discussion, the mp-id can be thought of as an arbitrary identifier. As each distinct calculation is performed, it is assigned an mp-id sequentially.

What Materials Project does is use the crystal structure itself to group calculations together that refer to the same crystal. We do this using the StructureMatcher class in pymatgen which can determine whether or not two crystal structures are equivalent subject to some tolerance.

In this way, the oldest/smallest mp-id for a given crystal structure becomes the canonical identifier we use in our database, with other calculations for that crystal structure grouped together with it.

There have been some attempts to create some kind of identifier that is deterministically assigned (e.g. based on space group, wyckoff position, or similar) but typically these identifiers tend to be very long and there’s usually some edge case not well handled (e.g. a general wyckoff position might require x, y and z co-ordinates to be defined).

The mp-id (being arbitrary) is not a perfect system, but it works decently well as a community standard since the MP database is open access and historical calculations remain available (vs, for example, the ICSD ID where multiple ICSD IDs might refer to the same crystal structure, and the ICSD is not itself open).

(Also, hi @blokhin! welcome to the forum)

blokhin · September 16, 2020, 12:11pm

Dear Matthew, many thanks for your answer. Could you recommend a way to retrieve all the active mp-id‘s and the metadata associated? I imagine a simple brute-forcing of all the integers from 1 to 1M, but this might be not the polite way of doing things.

Also, let’s say, we’d like to uniquely refer to a structure in the Materials Project, and there are more than one mp-id‘s for that, which one should we choose?

shyamd · September 16, 2020, 4:10pm

The task_ids field is what you’re looking for. If you grab that for each material, you’ll have all active mp-ids.

from pymatgen import MPRester

with mpr as MPRester():
    docs = mpr.query({},["task_ids"])

all_mp_ids = {t_id for doc in docs for t_id in doc["task_ids"]}

mkhorton · September 17, 2020, 2:30am

Just to add to @shyamd’s answer, if you do a query using pymatgen and ask for fields task_id, material_id and task_ids, here task_id and material_id are equivalent and synonymous with the “material ID” presented on the website, while task_ids is a list of all the individual calculation identifiers grouped with that specific material.

blokhin · September 17, 2020, 4:58pm

Many thanks, Gentlemen