Creating an accessible dataframe from MP-API query results

J0chin · August 25, 2022, 10:16pm

I am trying to convert the resulting query into a pandas dataframe as shown below, but format of the resulting dataframe looks off with the field name for fields I queried being in the tuples on each row instead of a being the title for each column Is there a simple way to get a dataframe like the last picture I have attached programmatically?

thanks in advance,

docs = mpr.summary.search(elements=["B"], exclude_elements=["N","C","Li","Ca","Mg","Na","K","Rb","Cs","Fr","Be","Sr","Ba","Ra","H","O","P","F","Cl"]     
                           ,fields=["formula_pretty", "density", "volume", "symmetry","material_id"])

df = pd.DataFrame(docs)
display(df)

Result of the displayed dataframe:

tschaume · August 25, 2022, 10:33pm

I think Dataframe.from_records() should work.

J0chin · August 25, 2022, 11:42pm

Hey @tschaume ,I tried this and got the same result, do I need us use some of the method parameters or combine this with another method?

tschaume · September 21, 2022, 7:46pm

Hi @J0chin,

here’s one way to go about it. It probably isn’t the most efficient and @munrojm might have better suggestions.

from flatten_dict import flatten

flattened = [{
    k: v for k, v in flatten(doc.dict(), reducer="dot").items() if k != "fields_not_requested"
} for doc in docs]

DataFrame.from_records(flattened)

HTH

kikin · May 18, 2023, 3:53pm

Hello,

I am having the exact same issue. I have tried a variety of pandas methods (str.replace, str.rstrip, str.lstrip lambda: x, …) but they completely take out all data in that column (resulting in NaN), including the numerical values I want. I’m wondering if this isn’t working because the data in the columns aren’t strings?

kikin · May 18, 2023, 4:09pm

Additionally, @tschaume 's solution does not work for my dataset

tschaume · May 18, 2023, 4:31pm

@kikin you’re trying to flatten a Dataframe whereas flatten takes a dictionary. Loop over the results from mpr.summary.search (docs in my example) and not the rows of a dataframe.

peterschindler · May 22, 2023, 4:50pm

Thanks for the useful code snippet!
I usually save the query results to a csv file and in that case I add the following line before exporting to csv. This makes it easier to load the structures later with Structure.from_dict or Structure.from_str.

data_df = DataFrame.from_records(flattened)

data_df['structure'] = data_df['structure'].apply(lambda x: x.to(fmt='json'))
data_df.to_csv('data.csv')

-Peter

Rudra_Banerjee · November 4, 2023, 10:07am

I have solved this issue by:

fields = [
    "material_id", "band_gap", "volume", 'formula_pretty', 'is_stable',
    'spacegroup'
]
with MPRester(apiid) as mpr:
    docs = mpr.materials.summary.search(chemsys="Si-O",
                                        band_gap=(0, 10),
                                        fields=fields)
 flattened = [{
      k: v
      for k, v in flatten(doc.dict(), reducer="dot").items()
      if k != "fields_not_requested"
  } for doc in docs]
 df = pd.DataFrame.from_records(flattened)
 df = df.drop(columns=[col for col in df if col not in fields])

But, IMHO, the problem can be easily solved if upstream considers not to write fileds_not_requested every time.

hakanfizik · December 25, 2023, 9:41pm

i solved my issue like this:

shortC=mpr.materials.summary.search(chemsys="C", fields=["material_id","energy_per_atom","structure"])

data = []
for doc in shortC:
    data.append({
        'material_id': doc.material_id,
        'energy_per_atom': doc.energy_per_atom,
        'structure': doc.structure
    })

dfC = pd.DataFrame(data)