Questions Regarding Molecules Dataset Fields and API Usage

hmqwjf · December 3, 2024, 6:49am

I am working on downloading the complete dataset for molecules using the Materials Project API and encountered some discrepancies that I hope you can help clarify.

I used the following code (Code 1) to download the dataset:

python

import pandas as pd

from mp_api.client import MPRester  

with MPRester(api_key=API_key, monty_decode=False, use_document_model=False) as mpr:  
    docs = mpr.molecules.summary.search()  
    df = pd.DataFrame(docs)  
    df.to_csv('molecules_data.csv')

The downloaded CSV file contains the following fields in the header:
_id, builder_meta, nsites, elements, nelements, composition, composition_reduced, formula_pretty, formula_anonymous, chemsys, volume, density, density_atomic, symmetry, property_name, material_id, deprecated, deprecation_reasons, last_updated, origins, warnings, structure, task_ids, uncorrected_energy_per_atom, energy_per_atom, formation_energy_per_atom, energy_above_hull, is_stable, equilibrium_reaction_energy_per_atom, decomposes_to, xas, grain_boundaries, band_gap, cbm, vbm, efermi, is_gap_direct, is_metal, es_source_calc_id, bandstructure, dos, dos_energy_up, dos_energy_down, is_magnetic, ordering, total_magnetization, total_magnetization_normalized_vol, total_magnetization_normalized_formula_units, num_magnetic_sites, num_unique_magnetic_sites, types_of_magnetic_species, bulk_modulus, shear_modulus, universal_anisotropy, homogeneous_poisson, e_total, e_ionic, e_electronic, n, e_ij_max, weighted_surface_energy_EV_PER_ANG2, weighted_surface_energy, weighted_work_function, surface_anisotropy, shape_factor, has_reconstructed, possible_species, has_props, theoretical, database_IDs.

However, when I used the following code (Code 2) to check the available fields:

from mp_api.client import MPRester

with MPRester(api_key=API_key, monty_decode=False, use_document_model=False) as mpr:  
    docs = mpr.molecules.summary.available_fields  
    print(docs)

The output fields were:
'builder_meta', 'charge', 'spin_multiplicity', 'natoms', 'elements', 'nelements', 'nelectrons', 'composition', 'composition_reduced', 'formula_alphabetical', 'formula_pretty', 'formula_anonymous', 'chemsys', 'symmetry', 'species_hash', 'coord_hash', 'property_name', 'property_id', 'molecule_id', 'deprecated', 'deprecation_reasons', 'level_of_theory', 'solvent', 'lot_solvent', 'last_updated', 'origins', 'warnings', 'molecules', 'molecule_levels_of_theory', 'inchi', 'inchi_key', 'task_ids', 'similar_molecules', 'constituent_molecules', 'unique_calc_types', 'unique_task_types', 'unique_levels_of_theory', 'unique_solvents', 'unique_lot_solvents', 'thermo', 'vibration', 'orbitals', 'partial_charges', 'partial_spins', 'bonding', 'multipole_moments', 'redox', 'metal_binding', 'has_props'.

Moreover, in the Molecules Explorer API example provided on the website (mpr.molecules.summary.search(molecule_ids=["042b6da7a6eb790fd5038f3729ef715c-C5H8O3-m1-2"])), the field molecule_ids is used. However, neither of the above outputs contains the molecule_idsfield.
Above all， How can I download the complete dataset for molecules with api ?

tschaume · December 3, 2024, 7:42am

Thanks for reporting this. We are aware of the underlying issue. Please see my comment in the mp-api repo and follow that GitHub issue for updates. For now, try upgrading your mp-api client to v0.44.0rc0.

hmqwjf · December 3, 2024, 7:59am

After pip install mp-api==v0.44.0rc0，but I have a new question :
mp_api.client.core.client.MPRestError: HTTPSConnectionPool(host=‘api.materialsproject.org’, port=443): Max retries exceeded with url: /molecules/summary/?_all_fields=True&_limit=1000&_skip=123000 (Caused by ProtocolError(‘Connection aborted.’, RemoteDisconnected(‘Remote end closed connection without response’)))
Retrieving MoleculeSummaryDoc documents: 21%|██ | 122000/577813 [06:18<23:33, 322.57it/s],how can i solve,thanks !

hmqwjf · December 3, 2024, 8:01am

this is my code :
import pandas as pd

from mp_api.client import MPRester

with MPRester(api_key=API_key, monty_decode=False, use_document_model=False) as mpr:
docs = mpr.molecules.summary.search()
df = pd.DataFrame(docs)
df.to_csv(‘molecules_summary.csv’)

tschaume · December 4, 2024, 6:37pm

Yeah, that’s a side effect of the temporary fix. See my comment here. Stay tuned.

tschaume · December 12, 2024, 9:16pm

We’ve just released a new database version and updated mp-api library that should fix this issue. Please upgrade to mp-api==0.44.0 and try again. Thanks!

hmqwjf · December 19, 2024, 6:32am

thanks a lot ! the data is same with your website’s data