Trouble downloading all possible molecule_ids

lllang · December 12, 2024, 6:07pm

Hello,

I am trying to use the API to download all the possible ‘molecule_ids’. I am doing this so I can better chunk my API requests. I use the following code:

with MPRester(MP_API_KEY) as mpr:
summary_docs = mpr.molecules.summary.search(fields=[‘molecule_id’])

However, there are two problems with this.

It is returning more fields than requested moelcule_id
It only downloads a subset of the data. For me it only download 155361/577813.

My question is how can I download the rest in the easiest way possible?

One solution to do on my end download the molecule_id that it gives me, the redownload with a filter to exclude the molecule_ids I already have.

Any suggestions would be appreciated!

Best Regards,

Logan Lang

Additional info:
mp-api 0.43.0

tschaume · December 12, 2024, 9:20pm

Thanks for reaching out. As for #2, that was an issue with the underlying data release. We’ve just released a new version and updated the mp-api library. Also see here. Please upgrade to mp-api==0.44.0. As for #1, the client will directly download all documents from our OpenData repositories when no query is provided. The fields argument will be ignored in this case and a warning is printed in the latest mp-api version. Please save the results for a full download to a file, and re-use it to extract what you need. See #4 here. HTH

lllang · December 13, 2024, 12:04am

Hey,

thanks for the quick response!

I tried your solution, it works and I am downloading past the previous point. However, now I am running into new issues.

The progress bar is bugged with how much data has been downloaded (Unsure if this has to do with my system). (This is not so much of an issue. It might just be slow updating due to the amount of data downloading)
The main issue is I run out of RAM before I get the chance to store the data. (For reference I have ~24GB of free RAM for this). I may be doing something wrong though. I looked at the links you have gave me, I apologize, but I still can’t piece together a solution. What would help would be restricting the data downloaded to just molecule_id, then I can handle the chunking of the data from my end, but as you said the fields keyword argument is ignored.

with MPRester(MP_API_KEY, monty_decode=False, use_document_model=False) as mpr:
        summary_docs = mpr.molecules.summary.search()
        with open(os.path.join(save_dir, 'molecules_summary.json'), 'w') as f:
            json.dump(summary_docs, f)

Again thank you the quick response and help!

Logan Lang

More information:

OS: Microsoft Windows 10 Pro
Processor: AMD Ryzen 7 3700X 8-Core Processor, 3600 Mhz, 8 Core(s), 16 Logical Processor(s)
python: Python 3.10.15

tschaume · December 13, 2024, 12:44am

Could you explain what your original goal was? Do you eventually need the full molecules data or are you interested only in a subset of chemical systems, symmetry groups, number of elements etc?

Running the data retrieval separately

with MPRester(MP_API_KEY, monty_decode=False, use_document_model=False) as mpr:
        summary_docs = mpr.molecules.summary.search()

took less than 8min for me and indeed used 35 GB of RAM to decompress and load all data into memory. I could see your RAM usage exploding way past that when you try to save all data into one JSON file.

The full compressed size of the 86 files in the OpenData repo is 1.9 GB. If memory usage is an issue, you could use the AWS CLI to retrieve those files directly via

aws s3 cp --no-sign-request --recursive s3://materialsproject-build/collections/2024-11-14/molecules/ mp_molecules/

This will save the files to the mp_molecules directory on your system and you can subsequently decompress and read parts of the data into memory as needed. HTH

lllang · December 13, 2024, 1:29am

The original goal is to run high-throughput calculations on all the molecules. I am also generally curious at what kind of data there is, so I would like to download the entire dataset and explore it myself.

The full compressed size of the 86 files in the OpenData repo is 1.9 GB. If memory usage is an issue, you could use the AWS CLI to retrieve those files directly via

This is exactly what I want, I got the data! I feel dumb, out of the many times I’ve looked at the docs I missed this part of the docs. I have never used the AWS CLI before, so thanks for mentioning this!

Thank you for taking the time to answer my questions!!

Logan Lang