Question regarding the correct energy field to reproduce energy above hull from the summary dataset

I downloaded the 2022-10-28 summary from AWS and attempted to reproduce the energy above hull values. However, using energy_per_atom, uncorrected_energy_per_atom, or formation_energy_per_atom does not give consistent results.
Could you clarify which energy field should be used to correctly reproduce the energy above hull?

Take a look at this post: the phase diagrams in MP are build from the thermo data, and thus from the energy_per_atom field which will include corrections if any apply. You’ll also need to make sure you use data that’s consistent in the “thermo type” (whether it’s from the GGA / GGA+U mixed hull, the r2SCAN only hull, or the combined GGA / GGA+U / r2SCAN hull).

Thanks for your quick response. I downloaded the data using
aws s3 cp --recursive --no-sign-request s3://materialsproject-build/collections/2022-10-28/summary mp_2022_summary, but I noticed that none of the entries contain a field indicating which exchange–correlation functional was used in the DFT calculations.

Further, I collected the thermo data with “aws s3 cp --recursive --no-sign-request s3://materialsproject-build/collections/2022-10-28/thermo/thermo_type=GGA_GGA+U/ 2022_10_28_thermo_GGA_GGA+U”, and analysed the thermo_type with following code,

“”“”
import os
import gzip
import json
import pandas as pd

from pymatgen.core import Structure
from pymatgen.io.vasp import Poscar

count = 0
mp_ids =
energy_per_atom =
energy_per_atom_corrected =
formation_energy =
composition =
e_hull =
energy_type =
for n_elements in os.listdir(“2022_10_28_thermo_GGA_GGA+U”):
for json_file in os.listdir(os.path.join(“2022_10_28_thermo_GGA_GGA+U”, n_elements)):
with gzip.open(
os.path.join(“2022_10_28_thermo_GGA_GGA+U”, n_elements, json_file), “rt”
) as f:
for line in f:
data = json.loads(line)
count += 1
energy_type.append(data[‘thermo_type’])
“”“”
However, set(energy_type) gives me {‘GGA_GGA+U’, ‘GGA_GGA+U_R2SCAN’, ‘R2SCAN’}. Does this mean that even if I downloaded the /thermo/thermo_type=GGA_GGA+U, there are still some entries that are calculated with functionals other than GGA/GGA+U?

but I noticed that none of the entries contain a field indicating which exchange–correlation functional was used in the DFT calculations.

Correct, you can see the data schemas we use for summary here. The thermo docs contain the XC functional.

Alternately and not recommended, you can use the origins field of the SummaryDoc to find the task which was used for the energy, and find the XC functional associated to that task.

The 2022-10-28 collection is very out of date, and it may not be as well curated as our current data. Overall, I’d strongly recommend using a recent collection like 2025-09-25 or our python API client, mp_api to retrieve the most recent data. Obviously if you need a specific past dataset, that advice doesn’t apply.

Thank you, I will try the latest version of the dataset. :+1: