I’d like to generate a list of formulas from MP and their associated bandgaps. I am not specific about the material type or domain. I suggest need a substantial number of (material, property) pairs. What is the best way of doing this?
So let’s say that all I need is a list of 1000 records of the form [“material_id”,“formula_pretty”,"band_gap”])
Is there a way of doing this efficiently?
Efficient ways to go about it depend on how often you need to retrieve the (same or different) chunk of 1000 materials. See our docs for details on how to set up the mp-api
python client. Also consult this page for tips and tricks to be aware of. If you only need a list of any 1000 materials once, you can use the num_chunks
and chunk_size
arguments:
from mp_api.client import MPRester
fields = ["material_id", "formula_pretty", "band_gap"]
with MPRester(APIKEY) as mpr:
docs = mpr.materials.summary.search(
fields=fields, chunk_size=1000, num_chunks=1
)
Rerunning this code snippet will return the same 1000 materials unless a query is added. This quickly becomes inefficient. Assuming that you’d like to repeatedly generate random chunks of 1000 materials from MP, I’d suggest you retrieve the fields you need for all materials from MP once and save it to a local file (make sure to update the file when new MP data releases come out). You can subsequently reuse the file to generate a randomized list of 1000 materials as often as needed. For instance,
import orjson
import gzip
from mp_api.client import MPRester
fields = ["material_id", "formula_pretty", "band_gap"]
with MPRester(APIKEY, use_document_model=False, monty_decode=False) as mpr:
docs = mpr.materials.summary.search(fields=fields)
option=orjson.OPT_NAIVE_UTC | orjson.OPT_SERIALIZE_NUMPY
dumped = orjson.dumps(docs, option=option)
fn = "mp_docs.json.gz"
with gzip.open(fn, 'wb') as f:
f.write(dumped)
with gzip.open(fn, 'rb') as f:
docs = orjson.loads(f.read())
# use the list of materials in `docs` to randomly select 1000
HTH