Mptrj contains material id mp-531102 but entries and web site do not

noam.bernstein · December 19, 2024, 7:40pm

As suggested in Mp-505260 missing from all tasks download - #6 by noam.bernstein I’m opening a new thread for the mptrj missing data issue.

mp-531102 is in mptrj, containing Fe-Bi-O, but it’s not on the web site or in the complete list of entries I downloaded. It is in the complete list of tasks, though, tagged with tasks_from_old_prod_2019. Is there any way to find out why it appears to have been removed from the current data?

Aaron_Kaplan · December 19, 2024, 8:32pm

Hi @noam.bernstein, that task is not valid because it uses Hubbard U’s that are inconsistent with what MP uses generally. You can check the validity of any task with the following code:

from emmet.core.vasp.validation import ValidationDoc

valid_doc = ValidationDoc.from_task_doc(task_doc) # task_doc should be a task you have

valid_doc.valid is True for any valid task, which would then go into a material. I can still pull task mp-531102 from the API to perform this check.

While the full task collection includes both valid and invalid tasks, we generally serve up only the valid ones.

Also please keep in mind that MPtrj is not an official Materials Project product, and AFAIK, does not check whether calculations are valid within MP’s internal validation scheme. We are considering distributing an MP-produced version of this dataset if there’s sufficient community interest.

noam.bernstein · December 19, 2024, 8:47pm

Thanks, that’s very helpful. I’ll use this validation function in the future.

tsmathis · December 19, 2024, 10:18pm

To follow up on Aaron’s point, @noam.bernstein, if you want to quickly get a list of all the (in)valid task_ids in MP for you can also use this:

import pandas as pd

task_validation_df = pd.read_parquet("s3://materialsproject-build/collections/{DB_VERSION}/task-validation/manifest.parquet")

valid_ids = task_validation_df[task_validation_df.valid == True].task_id

# proceed as you like

Starting from db version 2024-11-14 we are also releasing our task validation collection. This collection won’t be available via the api, but will generally be available via S3 and Open Data going forward for each db release

noam.bernstein · December 19, 2024, 10:31pm

How do I get a “task_doc” from the downloaded s3 tasks data? The downloaded data appears to be dicts stored as jsonl.

noam.bernstein · December 20, 2024, 1:58pm

[edited] never mind. I see that both of these previously unrelated issues are, in fact, related. When I put in the latest version, it works and doesn’t complain about credentials. The initial error misled me.

@tsmathis unfortunately, your code snippet fails for me with

botocore.exceptions.NoCredentialsError: Unable to locate credentials

also, is {DB_VERSION} meant to be literal, or am I supposed to be substituting a specific DB version?

Aaron_Kaplan · December 20, 2024, 6:24pm

Hey @noam.bernstein, maybe a quirk of AWS’s open data, try this:

from botocore import UNSIGNED

task_validation_df = pd.read_parquet("s3://materialsproject-build/collections/2024-11-14/task-validation/manifest.parquet",storage_options={"config_kwargs": {"signature_version": UNSIGNED}})

In @tsmathis’s example, DB_VERSION is one of the collections here

Each line of the jsonl file is an individual document, you would read them in line-by-line

noam.bernstein · December 20, 2024, 7:01pm

Each line of the jsonl file is an individual document, you would read them in line-by-line

Do you mean for the tasks? If so, I know how to read the dicts from the jsonl. But I don’t know how to convert them into the format that ValidationDoc.from_task_doc() wants. Although it’s moot, assuming I can get the same info by checking if the task id is present in the valid_ids list that @tsmathis’s code snippet gives.

tsmathis · December 20, 2024, 8:03pm

If you are only looking for the list of valid/invalid task ids (for a given db version), just pulling down the manifest.parquet using pandas and filtering the dataframe will definitely be quickest.

If you would rather actually pull all of those docs for further inspection you can use the following

import boto3
from botocore import UNSIGNED
from botocore.config import Config
from bson import json_util
from smart_open import open

client = boto3.client("s3", config=Config(signature_version=UNSIGNED))
paginator = client.get_paginator("list_objects_v2")
bucket = "materialsproject-build"
iterator = paginator.paginate(
    Bucket=bucket, Prefix="collections/2024-11-14/task-validation/"
)

document_object_prefixes = []
for page in iterator:
    document_object_prefixes.extend(
        [entry["Key"] for entry in page["Contents"] if "manifest" not in entry["Key"]]
    )

validation_docs = []
for prefix in document_object_prefixes:
    file = open(
        f"s3://{bucket}/{prefix}", encoding="utf-8", transport_params={"client": client}
    )
    validation_docs.extend(
        [json_util.loads(jline) for jline in file.read().splitlines()]
    )

This should generally work with all the jsonl.gz objects you might find while browsing MP’s Open Data repository.

Important to also note that you do need json_util from bson to correctly serialize these documents into their correct python primitives due to the extended json types that come with us having to interoperate object storage w/ mongo.

So that code will just get you a big list of all the dict representations of the documents. If you would rather have all the pydantic document models to have dot access to the attrs you could basically run something very similar to Aaron’s snippet from above after you’ve retrieved everything:

from emmet.core.vasp.validation import ValidationDoc
pydantic_serialized_validation_docs = [ValidationDoc(**doc) for doc in validation_docs]

The MPRester client does all of this behind the scenes for you when you say run an empty search to get all the docs for a given collection, e.g., for getting all the tasks mpr.materials.tasks.search().

And when using the client, we do recommend setting monty_decode and use_document_model to False to speed things up for large downloads using the client, see: Tips for Large Downloads | Materials Project Documentation. But this would again get you back to the dict representation.

However, last important note is that the client only supports downloading datasets that are currently correspond to the client’s endpoints, listed here: MP API endpoints. Datasets like the task-validation collection you would have to use the above code, or your own workaround to get.