Hello everyone,
I’m a phd student working on developing a robust, multi-functional, and user-friendly Bayesian Optimization framework. For benchmarking tests, I require a large volume of crystal structures along with their associated materials properties(no preference on specific properties, as I wanna test the robustness).
I’m currently retrieving such data using the OPTIMADE API. Below is the code I’m using(from the AFLOW provider):
from optimade.client import OptimadeClient
from optimade.adapters import Structure
from ase.io import write
import os
import csv
# Initialize the client to query only the AFLOW provider
client = OptimadeClient(include_providers=["aflow"])
# Use a filter to select records where _aflow_agl_heat_capacity_cp_300k is known
filter_query = '_aflow_agl_heat_capacity_cp_300k IS KNOWN'
result = client.get(filter=filter_query)
data_entries = result.get("structures", {}).get(filter_query, {}).get(client.base_urls[0], {})
# Set up output directory and CSV file
output_dir = "aflow_cif_files"
os.makedirs(output_dir, exist_ok=True)
csv_filename = "aflow_cp300.csv"
with open(csv_filename, "w", newline="") as csvfile:
csv_writer = csv.writer(csvfile)
csv_writer.writerow(["Index", "Filename", "_aflow_agl_heat_capacity_cp_300k"])
structures = data_entries.get("data", [])
for i, record in enumerate(structures, start=1):
try:
atoms = Structure(record).as_ase
except Exception as e:
print(f"Record {i} conversion failed: {e}")
continue
filename = f"structure_{i}.cif"
filepath = os.path.join(output_dir, filename)
try:
write(filepath, atoms, format="cif")
except Exception as e:
print(f"Record {i} writing failed: {e}")
continue
prop_value = record.get("attributes", {}).get("_aflow_agl_heat_capacity_cp_300k", "N/A")
csv_writer.writerow([i, filename, prop_value])
While this works well initially, I quickly reach the download limit of 1,000 records—which seems to cause subsequent structure downloads to fail (or at least not be written). Based on my previous experience with the Materials Project API (where using an API key helped avoid such limits), I’m wondering:
- Which part of the code or tutorial should I focus on to properly implement bulk downloads via the OPTIMADE API?
- Are there any key concepts or important details I might have missed (e.g., handling pagination, authentication, or API keys) that could help overcome the 1,000-record limit?
- Any general best practices for bulk downloading data from OPTIMADE providers?
I truly appreciate any guidance, references to relevant tutorials, or advice on how to adjust my workflow for handling large-scale downloads. Thank you very much for your time and for all the hard work the OPTIMADE team has put into this project!