Bulk Download Assistance for Crystal struct + properties Using OPTIMADE API(AFLOW Data as showcase)

Zhiyuan_Li · February 13, 2025, 1:06pm

Hello everyone,

I’m a phd student working on developing a robust, multi-functional, and user-friendly Bayesian Optimization framework. For benchmarking tests, I require a large volume of crystal structures along with their associated materials properties(no preference on specific properties, as I wanna test the robustness).

I’m currently retrieving such data using the OPTIMADE API. Below is the code I’m using(from the AFLOW provider):

from optimade.client import OptimadeClient
from optimade.adapters import Structure 
from ase.io import write          
import os
import csv

# Initialize the client to query only the AFLOW provider
client = OptimadeClient(include_providers=["aflow"])

# Use a filter to select records where _aflow_agl_heat_capacity_cp_300k is known
filter_query = '_aflow_agl_heat_capacity_cp_300k IS KNOWN'
result = client.get(filter=filter_query)
data_entries = result.get("structures", {}).get(filter_query, {}).get(client.base_urls[0], {})

# Set up output directory and CSV file
output_dir = "aflow_cif_files"
os.makedirs(output_dir, exist_ok=True)
csv_filename = "aflow_cp300.csv"

with open(csv_filename, "w", newline="") as csvfile:
    csv_writer = csv.writer(csvfile)
    csv_writer.writerow(["Index", "Filename", "_aflow_agl_heat_capacity_cp_300k"])

    structures = data_entries.get("data", [])
    
    for i, record in enumerate(structures, start=1):
        try:
            atoms = Structure(record).as_ase
        except Exception as e:
            print(f"Record {i} conversion failed: {e}")
            continue
        
        filename = f"structure_{i}.cif"
        filepath = os.path.join(output_dir, filename)
        
        try:
            write(filepath, atoms, format="cif")
        except Exception as e:
            print(f"Record {i} writing failed: {e}")
            continue
        
        prop_value = record.get("attributes", {}).get("_aflow_agl_heat_capacity_cp_300k", "N/A")
        csv_writer.writerow([i, filename, prop_value])

While this works well initially, I quickly reach the download limit of 1,000 records—which seems to cause subsequent structure downloads to fail (or at least not be written). Based on my previous experience with the Materials Project API (where using an API key helped avoid such limits), I’m wondering:

Which part of the code or tutorial should I focus on to properly implement bulk downloads via the OPTIMADE API?
Are there any key concepts or important details I might have missed (e.g., handling pagination, authentication, or API keys) that could help overcome the 1,000-record limit?
Any general best practices for bulk downloading data from OPTIMADE providers?

I truly appreciate any guidance, references to relevant tutorials, or advice on how to adjust my workflow for handling large-scale downloads. Thank you very much for your time and for all the hard work the OPTIMADE team has put into this project!

ml-evs · February 13, 2025, 2:51pm

You probably just want to set the max_results_per_provider client option to be whatever is reasonable in your case, otherwise the client by default limits you to 1000.

Using your script with

client = OptimadeClient(include_providers=["aflow"], max_results_per_provider=-1)

seems to start downloading the other entries correctly.