Why the values such as band gap and formation energy are differnet in task and summary

Asif-Iqbal-Bhatti · June 23, 2025, 2:36pm

Hello,

I don’t understand when I inquire with

properties=[‘uncorrected_energy_per_atom’, ‘band_gap’,‘formation_energy_per_atom’,‘total_magnetization’, ‘nsites’,
‘energy_above_hull’, ‘material_id’, ‘structure’, ‘grain_boundaries’, ‘is_metal’, ‘deprecated’]
docs = mpr.materials.summary.search(elements=[“Li”, “P”, “Cl”, “S”], fields=properties)

The bandgap, formation energy and hull energy is different with

opt_task_types = [
‘GGA Static’,
‘GGA Structure Optimization’,
#‘GGA+U Static’,
#‘GGA+U Structure Optimization’
]
task_doc = mpr.materials.tasks.search(task_ids=[tid], fields=[“input”, “output”, “calcs_reversed”, ‘task_id’, “run_type”])
task = task_doc[-1]
struct = output.structure
forces = output.forces
energy = output.energy
stress_kbar = output.stress
bandgap = output.bandgap

Could someone explain? What is going on? It should be the same. task_doc[1] is the optimization, and the output displays the final geometry/forces/stress. I want to understand how data was generated to train on Chgnet. I am extracting the raw vasp energies.

Aaron_Kaplan · June 23, 2025, 3:07pm

Hey @Asif-Iqbal-Bhatti, there are multiple tasks which are matched to a material, but not all tasks are used to construct properties for a material.

We often redo calculations for better data quality, but we keep all the older tasks for users to access (for data provenance, stewardship etc.)

For one of the materials you were querying for, mp-1040450, you can see on the task detail page that while there are two statics and a relaxation, only one of the statics is used to build materials properties. You can access the same task ID to material ID mapping from the API by querying mpr.materials.search(elements = ['Li','P','Cl','S'])

In the case of CHGNet / MPtrj:

The input sets used are the MPRelaxSet and MPStaticSet in pymatgen
The CHGNet training data includes task IDs - you can directly download the tasks you need from those, or download all of the tasks from AWS S3 and then filter the tasks you need

Asif-Iqbal-Bhatti · June 24, 2025, 7:02pm

Thank you for the explanation. So, it means after relaxation of the structure, MP does a static calculation to compute the property. It means to train CHGNet/M3GNET, SevenNet, … on MPTraj data; the energy corrections are from the static calculation (GGA and GGA_U corrections), and the trajectories are from the structure optimization calculations (raw forces/energies/stress at each ionic step). That is why we have

task_doc = mpr.materials.tasks.search(task_ids=[tid], fields=[“input”, “output”, “calcs_reversed”, ‘task_id’, “run_type”])
task = task_doc[-1]"

task_doc[1] instead of task_doc[0] (which is for static and 1 for optimizations). In all cases, such as mpr.materials.(tasks/thermo/summary).search, By default, final raw energy is shown without any processing.

For now I want to download the data by querying via the API. I want to learn how to do it this way. People often check the convergence and energy criteria for many static runs also.

=== Now for the AWS S3. This is new for me. extracting data from JSONL files. Is there an automated way to download all files from a webpage? I see there are a lot of them, and clicking one by one will take so much time. Is there a way in pymatgen to extract data from jsonl files.

Aaron_Kaplan · June 24, 2025, 9:06pm

So, it means after relaxation of the structure, MP does a static calculation to compute the property.

That’s often the case - for electronic structure calculations in VASP, properties like the density of states are not well-converged, and require a subsequent static calculation for better resolution

task_doc[1] instead of task_doc[0] (which is for static and 1 for optimizations).

The tasks are not necessarily linked like that. At a minimum, you’d have to check that the output.structure of one task reasonably matches the input.structure of another.

Again, sometimes we spot check calculations and realize that something was done sub-optimally. In those cases, we might just rerun the final static

Now for the AWS S3. This is new for me. extracting data from JSONL files. Is there an automated way to download all files from a webpage

Take a look at our documentation - there are many ways to automate downloads from S3

Personally, I prefer using the manifest.parquet file to determine which objects need to be retrieved. You can then use pandas read_json with the lines=True option to parse the jsonl.gz files:

import pandas as pd
import numpy as np

task_section = pd.read_json(
  "s3://materialsproject-parsed/tasks_atomate2/format=jsonl/nelements=9/symmetry_number=40/dt=2018-08-15-00-51-34-408170.jsonl.gz",
  lines=True
)
task_section.replace({np.nan: None})