How do I do a time-split of Materials Project entries? e.g. pre-2018 vs. post-2018

It’s been ~4 years since How to query date of entry addition to the Materials Project? and this is pretty critical to an upcoming manuscript on generative models, so hoping to get a bit more visibility. In particular, I’m hoping to unconditionally generate some number of candidate structures (e.g. 10 million) and see how many match the second half time-split of Materials Project entries, e.g. all entries that were added after YYYY/MM/DD (whatever date approximately splits it in half). Since this is more in a materials discovery context, I’d be interested in probing the first date it ever appeared on Materials Project, so created_at looks promising (similar to what was mentioned in the linked matsci post above). Would be great to get some confirmation if that’s the case.

Any recommendations on going about this? Feedback/alternative suggestions on a higher-level welcome too.

Hi @sgbaird, one of the easiest ways to do this would be to look at the mp-id. We assign those sequentially, so higher mp-ids are newer materials. You could use that to figure out where to split the materials. I’m not as familiar with the specifics of how to determine when a material was first added though.

As an FYI, you’ll probably want to make sure to pull data from our new api rather than the legacy API in mapidoc. You might be able to look at the dates of the individual tasks listed in origins to see the first appearance of a material.

Good luck!

1 Like

Hey Sterling,

FYI, I frequently generate splits in time using the ICSD information, you can do this via the references attribute of the snl object that can be returned by the legacy MPRester.query. You can then parse the references attribute using pybtex and get the years. I have a script somewhere to do this - I’ll take a look and see if I can post it later today.

1 Like

Using the legacy API. This hasn’t been rigorously tested, so use with caution.

from pymatgen.ext.matproj import MPRester
from pybtex.database.input import bibtex
from tqdm import tqdm
import pybtex.errors
pybtex.errors.set_strict_mode(False)
import re

with MPRester(api_key="0wCknzBAEXuITi71") as mpr:
    data = mpr.query(
        criteria={
            "elements": "Ti", # Remove for prod
            "icsd_ids": {"$ne": []}}, 
        properties=["pretty_formula", "snl", "material_id", "icsd_ids", "e_above_hull"]
    )
for datum in tqdm(data):
    parser = bibtex.Parser()
    refs = datum["snl"].references
    refs = parser.parse_string(refs)
    entries = refs.entries
    entries_by_year = [(int(entry.fields['year']), entry) for name, entry in entries.items()
                       if 'year' in entry.fields and re.match(r'\d{4}', entry.fields['year'])]
    if entries_by_year:
        entries_by_year = sorted(entries_by_year, key=lambda x: x[0])
        first_report = {"year": entries_by_year[0][0],
                        "authors": entries_by_year[0][1].persons['author']}
        first_report['authors'] = [str(auth) for auth in first_report['authors']]
        first_report["num_authors"] = len(first_report['authors'])
        datum.update({"discovery": first_report})

I imagine there’s a better way to do this with the new api, but wasn’t immediately obvious to me with the snl, etc.

1 Like

Both of these ideas are great, thank you @rkingsbury and @Joseph_Montoya!

@rkingsbury that makes a lot of sense that MPIDs would be assigned sequentially and would certainly be the most straightforward approach. Thanks! Great suggestion about converting over to the new API. The directions in the " Accessing Data" section seem pretty straightforward.

@Joseph_Montoya thanks for sharing the script! It looks like the DOIs section might do the trick, as it has a bibtex field.

Also, just noticed the Execute button on the new API Docs. Very cool! Makes it easy to know exactly what to expect from each field.

1 Like

Having some trouble getting the bibtex entries that contain ICSD tags using the new API.

For now, thinking I’ll use the theoretical field to filter down to experimental and then use the corresponding material_id-s as in:

from tqdm import tqdm
with MPRester(api_key) as mpr:
    provenance_results = [mpr.provenance.get_data_by_id(mid) for mid in tqdm(material_id)]

Since mpr.provenance.query() doesn’t seem to be functioning, though get_data_by_id() will be much slower and clog up the terminal output.

@rkingsbury any suggestions?

EDIT: resolved this in `ProvenanceRester`: query parameters which cannot be used: `nsites`, `elements` even though in `mpr.provenance.available_fields` · Issue #613 · materialsproject/api · GitHub by retrieving all ProvenanceDoc-s and cross-referencing against the experimental materials IDs.

Any REFs I should cite that use these kinds of time-splits?

We used this kind of time split in the final case in our recent paper on multi-fidelity active learning.

https://www.nature.com/articles/s41598-022-08413-8

1 Like

I’ve been reading through this. Very nice work, and very timely for me as I was planning to do a brief follow-up study to Effect of reducible and irreducible search space representations on adaptive design efficiency: a case study on maximizing packing fraction for solid rocket fuel propellant simulations | Materials Science | ChemRxiv | Cambridge Open Engage. Will be sure to cite.