How do I do a time-split of Materials Project entries? e.g. pre-2018 vs. post-2018

sgbaird · May 29, 2022, 4:03am

It’s been ~4 years since How to query date of entry addition to the Materials Project? and this is pretty critical to an upcoming manuscript on generative models, so hoping to get a bit more visibility. In particular, I’m hoping to unconditionally generate some number of candidate structures (e.g. 10 million) and see how many match the second half time-split of Materials Project entries, e.g. all entries that were added after YYYY/MM/DD (whatever date approximately splits it in half). Since this is more in a materials discovery context, I’d be interested in probing the first date it ever appeared on Materials Project, so created_at looks promising (similar to what was mentioned in the linked matsci post above). Would be great to get some confirmation if that’s the case.

Any recommendations on going about this? Feedback/alternative suggestions on a higher-level welcome too.

rkingsbury · May 31, 2022, 3:58pm

Hi @sgbaird, one of the easiest ways to do this would be to look at the mp-id. We assign those sequentially, so higher mp-ids are newer materials. You could use that to figure out where to split the materials. I’m not as familiar with the specifics of how to determine when a material was first added though.

As an FYI, you’ll probably want to make sure to pull data from our new api rather than the legacy API in mapidoc. You might be able to look at the dates of the individual tasks listed in origins to see the first appearance of a material.

Good luck!

Joseph_Montoya · May 31, 2022, 4:45pm

Hey Sterling,

FYI, I frequently generate splits in time using the ICSD information, you can do this via the references attribute of the snl object that can be returned by the legacy MPRester.query. You can then parse the references attribute using pybtex and get the years. I have a script somewhere to do this - I’ll take a look and see if I can post it later today.

Joseph_Montoya · May 31, 2022, 5:13pm

Using the legacy API. This hasn’t been rigorously tested, so use with caution.

from pymatgen.ext.matproj import MPRester
from pybtex.database.input import bibtex
from tqdm import tqdm
import pybtex.errors
pybtex.errors.set_strict_mode(False)
import re

with MPRester(api_key="0wCknzBAEXuITi71") as mpr:
    data = mpr.query(
        criteria={
            "elements": "Ti", # Remove for prod
            "icsd_ids": {"$ne": []}}, 
        properties=["pretty_formula", "snl", "material_id", "icsd_ids", "e_above_hull"]
    )
for datum in tqdm(data):
    parser = bibtex.Parser()
    refs = datum["snl"].references
    refs = parser.parse_string(refs)
    entries = refs.entries
    entries_by_year = [(int(entry.fields['year']), entry) for name, entry in entries.items()
                       if 'year' in entry.fields and re.match(r'\d{4}', entry.fields['year'])]
    if entries_by_year:
        entries_by_year = sorted(entries_by_year, key=lambda x: x[0])
        first_report = {"year": entries_by_year[0][0],
                        "authors": entries_by_year[0][1].persons['author']}
        first_report['authors'] = [str(auth) for auth in first_report['authors']]
        first_report["num_authors"] = len(first_report['authors'])
        datum.update({"discovery": first_report})

I imagine there’s a better way to do this with the new api, but wasn’t immediately obvious to me with the snl, etc.

sgbaird · May 31, 2022, 6:00pm

Both of these ideas are great, thank you @rkingsbury and @Joseph_Montoya!

@rkingsbury that makes a lot of sense that MPIDs would be assigned sequentially and would certainly be the most straightforward approach. Thanks! Great suggestion about converting over to the new API. The directions in the " Accessing Data" section seem pretty straightforward.

@Joseph_Montoya thanks for sharing the script! It looks like the DOIs section might do the trick, as it has a bibtex field.

Also, just noticed the Execute button on the new API Docs. Very cool! Makes it easy to know exactly what to expect from each field.

sgbaird · June 2, 2022, 3:10am

Having some trouble getting the bibtex entries that contain ICSD tags using the new API.

github.com/materialsproject/api

`ProvenanceRester`: query parameters which cannot be used: `nsites`, `elements` even though in `mpr.provenance.available_fields`

opened 02:52AM - 02 Jun 22 UTC

sgbaird

```python with MPRester(api_key) as mpr: mpr.provenance.search(nsites=(1, …52), elements=["V"]) ``` > ```bash > Traceback (most recent call last): > File "<string>", line 1, in <module> > File "C:\Users\sterg\Miniconda3\envs\mp-time-split\lib\site-packages\mp_api\core\client.py", line 786, in search > return self._get_all_documents( > File "C:\Users\sterg\Miniconda3\envs\mp-time-split\lib\site-packages\mp_api\core\client.py", line 835, in _get_all_documents > results = self._query_resource( > File "C:\Users\sterg\Miniconda3\envs\mp-time-split\lib\site-packages\mp_api\core\client.py", line 288, in _query_resource > data = self._submit_requests( > File "C:\Users\sterg\Miniconda3\envs\mp-time-split\lib\site-packages\mp_api\core\client.py", line 387, in _submit_requests > initial_data_tuples = self._multi_thread(use_document_model, initial_params_list) > File "C:\Users\sterg\Miniconda3\envs\mp-time-split\lib\site-packages\mp_api\core\client.py", line 587, in _multi_thread > data, subtotal = future.result() > File "C:\Users\sterg\Miniconda3\envs\mp-time-split\lib\concurrent\futures\_base.py", line 439, in result > return self.__get_result() > File "C:\Users\sterg\Miniconda3\envs\mp-time-split\lib\concurrent\futures\_base.py", line 391, in __get_result > raise self._exception > File "C:\Users\sterg\Miniconda3\envs\mp-time-split\lib\concurrent\futures\thread.py", line 58, in run > result = self.fn(*self.args, **self.kwargs) > File "C:\Users\sterg\Miniconda3\envs\mp-time-split\lib\site-packages\mp_api\core\client.py", line 652, in _submit_request_and_process > raise MPRestError( > mp_api.core.client.MPRestError: REST query returned with error status code 400 on URL https://api.materialsproject.org/provenance/?nsites=1&nsites=52&elements=V&_all_fields=True&_limit=1000 with message: > Request contains query parameters which cannot be used: nsites, elements > ``` Note that `nsites` and `elements` are in `available_fields`: ```python mpr.provenance.available_fields ``` > ```python > ['builder_meta', 'nsites', 'elements', 'nelements', 'composition', 'composition_reduced', 'formula_pretty', 'formula_anonymous', 'chemsys', 'volume', 'density', 'density_atomic', 'symmetry', 'property_name', ...] > special variables > function variables > 00:'builder_meta' > 01:'nsites' > 02:'elements' > 03:'nelements' > 04:'composition' > 05:'composition_reduced' > 06:'formula_pretty' > 07:'formula_anonymous' > 08:'chemsys' > 09:'volume' > 10:'density' > 11:'density_atomic' > 12:'symmetry' > 13:'property_name' > 14:'material_id' > 15:'deprecated' > 16:'deprecation_reasons' > 17:'last_updated' > 18:'origins' > 19:'warnings' > 20:'created_at' > 21:'references' > 22:'authors' > 23:'remarks' > 24:'tags' > 25:'theoretical' > 26:'database_IDs' > 27:'history' > len():28 > ``` Similar to #612, `get_data_by_id()` seems to work fine: ```python mpr.provenance.get_data_by_id('mp-771054') ``` > ```python > ProvenanceDoc(builder_meta=EmmetMeta(emmet_version='0.18.0', pymatgen_version='2022.0.16', pull_request=644, database_version='2021.11.10', build_date=datetime.datetime(2021, 11, 25, 10, 20, 36, 310000)), nsites=None, elements=None, nelements=None, composition=None, composition_reduced=None, formula_pretty=None, formula_anonymous=None, chemsys=None, volume=None, density=None, density_atomic=None, symmetry=None, property_name='provenance', material_id=MPID(mp-771054), deprecated=False, deprecation_reasons=None, last_updated=datetime.datetime(2021, 11, 25, 10, 20, 36, 310000), origins=[], warnings=[], created_at=datetime.datetime(2021, 11, 25, 10, 20, 36, 310000), references=['@article{Jain2013,\nauthor = {Jain, Anubhav and Ong, Shyue Ping and Hautier, Geoffroy and Chen, Wei and Richards, William Davidson and Dacek, Stephen and Cholia, Shreyas and Gunter, Dan and Skinner, David and Ceder, Gerbrand and Persson, Kristin a.},\ndoi = {10.1063/1.4812323},\nissn = {2166532X},\njournal = {APL Materials},\nnumber = {1},\npages = {011002},\ntitle = {{The Materials Project: A materials genome approach to accelerating materials innovation}},\nurl = {http://link.aip.org/link/AMPADS/v1/i1/p011002/s1\\&Agg=doi},\nvolume = {1},\nyear = {2013}\n}\n\n@misc{MaterialsProject,\ntitle = {{Materials Project}},\nurl = {http://www.materialsproject.org}\n}'], authors=[Author(name='Materials Project', email='[email protected]')], remarks=[], tags=[], theoretical=True, database_IDs={}, history=[History(name='Materials Project Optimized Structure', url='http://www.materialsproject.org', description=None)]) > ```

github.com/materialsproject/api

DOIRester `search` method (via `mpr.doi.search`) not returns "Not Found"

opened 02:43AM - 02 Jun 22 UTC

closed 09:39PM - 03 Jun 22 UTC

sgbaird

```python doi_fields = ["doi", "bibtex", "task_id"] nsites = (1, 52) elements… = ["V"] with MPRester(api_key) as mpr: doi_results = mpr.doi.search(nsites=nsites, elements=elements, fields=doi_fields) ``` ```bash Traceback (most recent call last): File "<string>", line 1, in <module> File "C:\Users\sterg\Miniconda3\envs\mp-time-split\lib\site-packages\mp_api\core\client.py", line 786, in search return self._get_all_documents( File "C:\Users\sterg\Miniconda3\envs\mp-time-split\lib\site-packages\mp_api\core\client.py", line 835, in _get_all_documents results = self._query_resource( File "C:\Users\sterg\Miniconda3\envs\mp-time-split\lib\site-packages\mp_api\core\client.py", line 288, in _query_resource data = self._submit_requests( File "C:\Users\sterg\Miniconda3\envs\mp-time-split\lib\site-packages\mp_api\core\client.py", line 387, in _submit_requests initial_data_tuples = self._multi_thread(use_document_model, initial_params_list) File "C:\Users\sterg\Miniconda3\envs\mp-time-split\lib\site-packages\mp_api\core\client.py", line 587, in _multi_thread data, subtotal = future.result() File "C:\Users\sterg\Miniconda3\envs\mp-time-split\lib\concurrent\futures\_base.py", line 439, in result return self.__get_result() File "C:\Users\sterg\Miniconda3\envs\mp-time-split\lib\concurrent\futures\_base.py", line 391, in __get_result raise self._exception File "C:\Users\sterg\Miniconda3\envs\mp-time-split\lib\concurrent\futures\thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "C:\Users\sterg\Miniconda3\envs\mp-time-split\lib\site-packages\mp_api\core\client.py", line 652, in _submit_request_and_process raise MPRestError( mp_api.core.client.MPRestError: REST query returned with error status code 404 on URL https://api.materialsproject.org/doi/?nsites=1&nsites=52&elements=V&_limit=1000&_fields=doi%2Cbibtex%2Ctask_id with message: Not Found ``` `mpr.doi.get_data_by_id()` seems to work fine: ```python mpr.doi.get_data_by_id('mp-771054') ``` ```bash DOIDoc(doi='10.17188/1300260', bibtex='@misc{osti_1300260,\n author = "Persson, Kristin",\n title = "Materials Data on V3(PO4)2 (SG:2) by Materials Project",\n abstractNote = "Computed materials data using density functional theory calculations. These calculations determine the electronic structure of bulk materials by solving approximations to the Schrodinger equation. For more information, see https://materialsproject.org/docs/calculations",\n doi = "10.17188/1300260",\n place = "United States",\n year = "2016",\n month = "2",\n note = "An optional note"\n}\n', task_id='mp-771054') ```

For now, thinking I’ll use the theoretical field to filter down to experimental and then use the corresponding material_id-s as in:

from tqdm import tqdm
with MPRester(api_key) as mpr:
    provenance_results = [mpr.provenance.get_data_by_id(mid) for mid in tqdm(material_id)]

Since mpr.provenance.query() doesn’t seem to be functioning, though get_data_by_id() will be much slower and clog up the terminal output.

~~@rkingsbury any suggestions?~~

EDIT: resolved this in `ProvenanceRester`: query parameters which cannot be used: `nsites`, `elements` even though in `mpr.provenance.available_fields` · Issue #613 · materialsproject/api · GitHub by retrieving all ProvenanceDoc-s and cross-referencing against the experimental materials IDs.

sgbaird · June 2, 2022, 3:12am

Any REFs I should cite that use these kinds of time-splits?

Joseph_Montoya · June 2, 2022, 6:42am

We used this kind of time split in the final case in our recent paper on multi-fidelity active learning.

https://www.nature.com/articles/s41598-022-08413-8

sgbaird · June 2, 2022, 3:18pm

I’ve been reading through this. Very nice work, and very timely for me as I was planning to do a brief follow-up study to Effect of reducible and irreducible search space representations on adaptive design efficiency: a case study on maximizing packing fraction for solid rocket fuel propellant simulations | Materials Science | ChemRxiv | Cambridge Open Engage. Will be sure to cite.