If you are only looking for the list of valid/invalid task ids (for a given db version), just pulling down the manifest.parquet
using pandas and filtering the dataframe will definitely be quickest.
If you would rather actually pull all of those docs for further inspection you can use the following
import boto3
from botocore import UNSIGNED
from botocore.config import Config
from bson import json_util
from smart_open import open
client = boto3.client("s3", config=Config(signature_version=UNSIGNED))
paginator = client.get_paginator("list_objects_v2")
bucket = "materialsproject-build"
iterator = paginator.paginate(
Bucket=bucket, Prefix="collections/2024-11-14/task-validation/"
)
document_object_prefixes = []
for page in iterator:
document_object_prefixes.extend(
[entry["Key"] for entry in page["Contents"] if "manifest" not in entry["Key"]]
)
validation_docs = []
for prefix in document_object_prefixes:
file = open(
f"s3://{bucket}/{prefix}", encoding="utf-8", transport_params={"client": client}
)
validation_docs.extend(
[json_util.loads(jline) for jline in file.read().splitlines()]
)
This should generally work with all the jsonl.gz
objects you might find while browsing MP’s Open Data repository.
Important to also note that you do need json_util
from bson
to correctly serialize these documents into their correct python primitives due to the extended json types that come with us having to interoperate object storage w/ mongo
.
So that code will just get you a big list of all the dict
representations of the documents. If you would rather have all the pydantic
document models to have dot access to the attrs you could basically run something very similar to Aaron’s snippet from above after you’ve retrieved everything:
from emmet.core.vasp.validation import ValidationDoc
pydantic_serialized_validation_docs = [ValidationDoc(**doc) for doc in validation_docs]
The MPRester client does all of this behind the scenes for you when you say run an empty search to get all the docs for a given collection, e.g., for getting all the tasks mpr.materials.tasks.search()
.
And when using the client, we do recommend setting monty_decode
and use_document_model
to False to speed things up for large downloads using the client, see: Tips for Large Downloads | Materials Project Documentation. But this would again get you back to the dict
representation.
However, last important note is that the client only supports downloading datasets that are currently correspond to the client’s endpoints, listed here: MP API endpoints. Datasets like the task-validation
collection you would have to use the above code, or your own workaround to get.