Materials Project Database get deprecated structures

Hey,
I am training a machine learning model on the MaterialsProject database and it seems, that the performance of my model has decreased. I order to check if I messed up my model, I am trying to download the old data with the deprecated datapoints. Is this still possible? I tried to query the structures like this:

with MPRester(self.apikey) as m:
    query = m.query(
        criteria={"deprecated": True},
        properties=["deprecated", ],
    )
    for q in query:
        if q["deprecated"]:
            print("deprecated!")

but it seems like there are no deprecated structures in the data, because nothing is printed.

Hi @Stefaan,

It should be possible, however I see the same issue. It may be that the deprecated flag is not whitelisted and so is ignored when you perform the query.

If you’re using a newer version of pymatgen you can also check your ~/.pmgrc.yaml log to see what database version you were using more often, perhaps the database version changed after you trained your model.

If this is causing a specific problem for your work, if you let me know the exact database version and fields you need, I can send you a database dump from an older version.

Note that in our upcoming API you will be able to query for data from a given database version to avoid this issue, but this is not yet publicly available. We’re working hard however :slight_smile:

Hope this helps,

Matt

I’ve confirmed the deprecated key was not whitelisted, this has been fixed and will be live in our next release (~weeks).

Hi @mkhorton ,

thanks a lot for your reply and for your nice work!
I was trying to reproduce our own benchmark from a paper released in 2018. Do you think it is still possible to get database with Version 2.0.0 (Release Date: 04/13/2016). I think this must have been the data, that has been used for our benchmark model. I would need the structures and the “formation energy per atom”.

Best regards,
Stefaan

That version from 2016 is pretty far back for us, I’m not sure I have it to hand. All the tasks from back then should still be available via the API if you wanted to reconstruct the formation energies but it’d be a bit of an effort. I’ll check however.

We have a new release coming soon with an updated compatibility scheme, meaning better formation energies, so you may prefer to wait for that and re-train with that data.

There is also a data dump from some collaborators that’s available from 2018 Graphs of materials project where they archived data for training of their own ML models. I’d definitely encourage creating an archive of training data prior to publishing a model because the up-front values reported by the Materials Project do change as new calculations come in, even if data for individual calculations remains available it can be an effort to re-construct derived information like formation energies.

Best,

Matt

Hi @mkhorton ,

thanks for your reply! I started a training session based on the 2018 dataset and this seems to reproduce the benchmark pretty well. So there is no need for the old 2016 data. Thanks a lot for your help!

Best, Stefaan

Glad to hear it! Our new release went live yesterday too, so you may find our new data better as well. On the imminent horizon will also be the release of our new formation energy scheme which will hopefully make our predictions that much closer to experiment too.