How can I collect the same dataset before change of database system

motonuko · January 16, 2020, 5:20am

Hi

I want to collect the material data written on this list(https://github.com/txie-93/cgcnn/blob/d612a69530a72ba686fca56813657b89f2440cc5/data/material-data/mp-ids-46744.csv).

But I can’t collect the same size(=46744) dataset.

Here is my python code,
m = MPRester(API_KEY)
# labels from the link (mp-ids-46744.csv)
material_labels = [mp-754118, mp-978908, mp-633688, …]
results = m.query(criteria={“task_ids”: {"$in": material_labels}}, properties=my_property_list)

I tried using task_id & material_id in m.query(), but I couldn’t collect the same size dataset. Is this caused by database change in 2018? or some code problem?

https://matsci.org/t/change-in-materials-project-ids/1268

Thanks,

motonuko · January 16, 2020, 5:31am

Sorry, I found same problem in this topic. But this hasn’t been solved yet.

kdmiller · January 16, 2020, 3:50pm

It’s caused by some changes to the mpids of materials. I found it messy to try to find the dataset using the mpids so it was easier for me to just requery the dataset using the properties I knew were used to collect it in the first place (i.e. use MPRester to find all sulfides with x metals, etc.)

motonuko · January 17, 2020, 1:13am

Thanks.
I’ll try it!

shyamd · January 21, 2020, 3:58pm

Hi @motonuko

All of the materials should still be there. I did a quick check in task_ids are still there. What likely happened is that a number of these structures were actually duplicates of each other. This goes back to how we define materials.

A Material is a collection of calculations, where each calculation has an ID of the form: mp-35434. Any of these IDs refer to that material. Prior to 2018, we were matching structures at the beginning of the calculation to determine which calculations to group together. This resulted in a number of duplicates.

The reason you see fewer entries now is because those duplicates are now only showing up once. From an ML standpoint, this is what you want since repeated entries can be bad for ML training and can bias the MAE from CV. Just because you hold out a subset doesn’t mean the algorithm hasn’t already seen a duplicate in another batch resulting in an underprediction of the true MAE.

motonuko · January 24, 2020, 6:52am

HI! @shyamd. Thanks for your explanation!

How did you check duplicate materials?

I ran this code

results = m.query(criteria={"task_ids": {"$in": material_labels}}, properties=my_property_list)

and checked the uniqueness of “material_id” in results, but all “material_id” looks unique. It makes me confused.

shyamd · January 24, 2020, 4:51pm

I think this is part of the confusion. The mp- ID is actually referring to a calculation not a material. A material is composed of any number of calculations, so it actually has multiple mp- IDs all listed in task_ids field. For all intents and purposes, task_id or material_id is arbitrary as long as it’s one of the IDs in task_ids.

Another way to put it is, that a material doesn’t have just one ID, but a whole list of IDs and all of these are valid. We changed the way in which we put these IDs together to make materials to be better at finding duplicates, which resulted in fewer materials for the same list of IDs.

We use the spacegroup from spglib and StructureMatcher in pymatgen to determine uniqueness. StructureMatcher implements an algorithm that puts different structures in the same setting, looking for things like supercells, and then compares the site positions, angles, and lattice vectors within a certain tolerance to determine if they are the same. Previously we did this matching at the beginning of a calculation. Now we do it at the end.