Force data of crystal and solid state materials in geometry optimization

Python script

import requests
import json

base_url = 'http://nomad-lab.eu/prod/v1/api/v1'

# The response from the API is stored in the response variable
response1 = requests.post(
    f'{base_url}/entries/query',
    json={
        
        'query': {
            'results.properties.geometry_optimization.final_force_maximum': {
                'gt': 1.2e-10
            }
        },
        
        'required': {
            'run': {"calculation[-1]": {"forces": "*"}}
        }
        
    })

response1_json = response1.json() 

response1_json['data'][-1]['quantities']

Explanation

The geometry_optimization and forces related data are stored in the above variable. However, it seems that there are only numerical values of ‘cell_volume’ and ‘lattice_parameters’. I’m wondering how to retrieve the numerical quantities of the forces data? which should be a vector with one row and three columns(for example: (a, b, c)), indicating three directions x, y, z for each atom in crystal(eg. FeO), at each equilibrium at each type of calculation.

There are two different APIs targeting different data. The entries API (entries/query) will only return a thin top-layer of metadata. This is basically what is stored in sections metadata and results when you look at any entry in the “data” tab (e.g. here). Here you will only find what quantities exist in the entry (under the quantities key), but not the values of these quantities.

To target the actual values (or anything that is stored in section run, like run/calculation/forces/total/value), you have to use the archive API (entries/archive/query). Here the required works like you wanted it to use with a dictionary containing * placeholder.

This would be your example, extracting the force value vector:

import requests

base_url = 'http://nomad-lab.eu/prod/v1/api/v1'

response = requests.post(
    f'{base_url}/entries/archive/query',
    json={
        'query': {
            'results.properties.geometry_optimization.final_force_maximum': {
                'gt': 1.2e-10
            }
        },
        'required': {
            'run': {"calculation[-1]": {"forces": "*"}}
        }

    })

archive = response.json()['data'][-1]['archive']

print(
    archive['run'][-1]['calculation'][-1]['forces']['total']['value'])

Hi Markus, thanks a lot! I have other two questions:

  1. In the ‘required’ part, we specify the last calculation, i.e. ‘calculation[-1]’. So it is easy to check that the length of ‘archive[‘run’][-1][‘calculation’]’ is 1:
len(archive['run'][-1]['calculation'])==1

If we don’t specify the last calculation in the ‘required’ part, and replace ‘calculation[-1]’ with ‘calculation’, it would return 12 results in this example:

len(archive['run'][-1]['calculation'])==12

Meanwhile, when I explore data on NOMAD, I navigate in this way: Explore-Entries, then I click a certain result and ‘GO TO THE ENTRY PAGE’, and I find that there is a figure of ‘Geometry optimization’ in the right card. In X-axis of the figure there is a ‘Step number’.

So my question is, is the ‘calculation’ exactly what the ‘Step number’ means in ‘Geometry optimization’ figure?

My second question:

In the above example, there is only one element in the list of ‘archive[‘run’]’

len(archive['run'])==1

However, we didn’t specify the last ‘run’ like ‘run[-1]’ in ‘required’ part, like we can do with ‘calculation’ or ‘calculation[-1]’. My question is how many ‘run’ can be retrieved? or is there actually only one ‘run’ for each ‘entry_id’?

By the way, could you please clarify the difference among ‘entry_id’, ‘upload_id’, and ‘material_id’ ? Any help would be appreciated!

Entries with more than 1 run are a very rare exception. Therefore, asking for run or run[-1] practically makes no difference.

Typically the calculation array comprises the optimization steps. Those entries also have a workflow and/or workflow2 section (we are currently transition between models). The workflow section gives you more details and contains information about the optimization (if a optimization was performed). Here is an example for workflow: https://nomad-lab.eu/prod/v1/staging/gui/search/entries/entry/id/zlbgPUoF2Ll2jiZxwvldBdYSRaga/data/workflow/0/geometry_optimization

The required part allows you to specify what you want to download from the API. Therefore, you get more or less calculations depending on your required.

Hi Markus, thanks for your explanation. When I applied the code to retrieve desired data, it only returned 10 results by default, which means ‘page size’=10, the maximum number of items contained in one response:

len(response.json()['data'])==10

My first question is how to change the default setting ‘page_size=10’ in ‘pagination’, so it would return ALL available data that meet the above criterions.

I guess ALL available entries are approximately 10 million with ‘geometry_optimization’:

When I print out the response, I noticed that there is another parameter ‘order_by’. My second question is, how to set the output ordered by time, for example, change the ‘order_by’ from ‘entry_id’ to ‘date’, so that it would return data chronologically.

The request is structured analogous to the response and your request object can contain a pagination key with an object that specifies page_size and order_by. Of course, you cannot retrieve all data with one request and you will need to go through a loop. You can read about pagination here: Using the APIs - Documentation. You can play with the page_size, but usually you should not go higher than 1000. If you go to high, their either is a maximum, or the thread answering your request will run out of memory or into a timeout.

The key that the search interface on the UI is using to order by time is upload_create_time. Also on the search page of the UI there is this <> button on top of the filter that shows you the API requests that the UI is performing. The pagination here works the same as the pagination on the archive endpoints. This is copied from there as an example:

  "pagination": {
    "page_size": 20,
    "order_by": "upload_create_time",
    "order": "desc",
    "page_after_value": "1687860023173:zWYrEriwMXtlamNIZZmN0ym771ab",
  }

The page_after_value is a combination of the entry_id and the order_by: upload_create_time value. Basically gives you the next 20 values after the entry with these values.

Hi Markus, I do appreciate for your patience. I looked through the NOMAD Documentation and haven’t found an efficient way to retrieve ALL data. In my case, the POST request is continuously sent to the NOMAD API with the query, required fields, and pagination settings(e.g. page_size=10), and store retrieved data into a JSON file in a batch size(e.g. every 100 entries). To make the program continuously, I also store the current ‘page_after_value’ and ‘a’ to a JSON file every batch the data is stored. If the program is interrupted and restarted, it will first look for the most recent file and will start from the saved ‘page_after_value’ and ‘a’. HOWEVER, this still seems to take endless time to retrieve ALL the data from the NOMAD database.

import requests
import json
from tqdm.auto import tqdm
import time
import os

base_url = 'http://nomad-lab.eu/prod/v1/api/v1'
query = {
    'results.properties.geometry_optimization.final_force_maximum': {
        'gt': 1.2e-10 # where the value is greater than 1.2e-10
    }
}

required = {
    'run': {
        "calculation": {"forces": "*", "stress": "*", "energy": "*"}, 
        "system": {"atoms": {"species": "*", "positions": "*", "lattice_vectors": "*"}}
    }
}

page_size = 10
order_by = 'upload_create_time'

data_store = []


# Loading Saved Progress: checks for a saved progress file. 

# Check if there's a saved progress file
# If it exists, it loads the page_after_value and a (the request counter) from this file
if os.path.exists('progress.json'):
    with open('progress.json', 'r') as f:
        progress = json.load(f)
        page_after_value = progress['page_after_value']
        print("page_after_value =", page_after_value )
        print('\n')
        a = progress['a']
        print("a=", a)
        
# If the file doesn't exist, it sets page_after_value to None and a to 0.
else:
    page_after_value = None
    a = 0


# progress bar and loop now start from the saved value of a, instead of always starting from 0, if the script was interrupted and restarted
m = 10109108 
for i in range(a, m):  # starts from 'a'
    a += page_size
    
    with tqdm(total=m, initial=a) as pbar:
        for i in range(a, m):
            a += 1
            pbar.update(1)

            try:
                response = requests.post(
                    f'{base_url}/entries/archive/query',
                    json=dict(
                        query=query,
                        required=required,
                        pagination=dict(page_after_value=page_after_value, page_size=page_size, order_by=order_by)
                ))

                response_json = response.json()

                if len(response_json['data']) == 0:
                    print("response_json['data']=0")
                    break

                data_store.append(response_json['data'])
                page_after_value = response_json['pagination']['next_page_after_value']


                # save the current progress to a file every time it saves the retrieved data
                if a % 100 == 0:
                    with open(f'data_store_{a}_by_date.json', 'w') as f:
                        json.dump(data_store, f)

                    #  saves the current page_after_value and a to the progress file
                    # if the script is interrupted, the most recent page_after_value and a will be saved.
                    with open('progress.json', 'w') as f:
                        json.dump({'page_after_value': page_after_value, 'a': a, 'entry_id':  response_json["data"][-1]["entry_id"]}, f)

                    data_store = []


            except Exception as e:
                print(f"error occurred: {e}")
                time.sleep(2)
                continue

        # retrieve the remaining data that hasn't been written to file in one batch(m%100)
        if len(data_store) > 0:
            with open(f'data_store_{a}.json', 'w') as f:
                json.dump(data_store, f)

Is there any way to speed up retrieving the data, or do you know where might be the bottleneck limiting the speed of retrieving?

From NOMAD user interface, it shows there are around 12 million entries:


and the total number of my desired entries is around 10 million:

I believe there might be more direct ways to retrieve ALL desired data, with Python ‘requests’ library or others, or administrator access?

Although we can download the raw file from NOMAD, but I don’t think it’s an elegant and smart way to retrieve ALL data that meet above criterions.

I am afraid it is not a super fast process. There are a few things that might improve the process:

  • You are requiring all calculations and all systems. Not just the optimization results. Is this really what you want. These are ~100 million.
  • If you basically want the whole archives, you can try this endpoint: /uploads/{upload_id}/archive/{entry_id}. You have todo a search with with this endpoint to get the entry and upload ids.
  • We are currently testing a new implementation of the underlying file format that hopefully resolves some of the performance problems. But this requires that we reprocess all the data first. I don’t think this will be usable for another 3 months.
  • There are plans for an API that exports uploads as a whole. The intention is to allow mirrors and stuff like you are doing. There is some implementation that you can test: /uploads/{upload_id}/bundle. You are only interested in include_archive_files. This will provide zip files that contain the archive data in message pack format.

There might be other ways to give you the data. It would be great to learn a bit about the brackground and what you are trying to do. Please write to [email protected].

Hi @JayLiu1!

I think there might be a few problems with this script, which also makes the time estimate unreasonable:

  1. You might be looping too much: there are two nested loops that loop from a to m: I don’t think this is supposed to happen.
  2. You are making an API request for each of the 10 109 108 results. Make batched API calls.
  3. You’re page size is quite small: this will result in more disc writes and API calls. Try increasing it.
  4. Are you sure you want entries with final_force_maximum greater than 1.2e-10N? Typically you would be interested in the ones where this optimization force is smaller than some value.

Here is a modified version (does not address issue #4). Maybe this will help with getting the results faster?

import requests
import math
import json
from tqdm.auto import tqdm
import time
import os

base_url = 'http://nomad-lab.eu/prod/v1/api/v1'
query = {
    'results.properties.geometry_optimization.final_force_maximum': {
        'gt': 1.2e-10 # where the value is greater than 1.2e-10
    }
}

required = {
    'run': {
        "calculation": {"forces": "*", "stress": "*", "energy": "*"}, 
        "system": {"atoms": {"species": "*", "positions": "*", "lattice_vectors": "*"}}
    }
}

order_by = 'upload_create_time'

data_store = []


# Loading Saved Progress: checks for a saved progress file. 

# Check if there's a saved progress file
# If it exists, it loads the page_after_value and a (the request counter) from this file
if os.path.exists('progress.json'):
    with open('progress.json', 'r') as f:
        progress = json.load(f)
        page_after_value = progress['page_after_value']
        print("page_after_value =", page_after_value )
        print('\n')
        a = progress['a']
        print("a=", a)
# If the file doesn't exist, it sets page_after_value to None and a to 0.
else:
    page_after_value = None
    a = 0


# progress bar and loop now start from the saved value of a, instead of always
# starting from 0, if the script was interrupted and restarted
n_total = requests.post(
    f'{base_url}/entries/archive/query',
    json=dict(
        query=query,
        required={},
        pagination=dict(page_size=0, order_by=order_by)
)).json()['pagination']['total']
batch_size = 250
save_temp_size = 10 * batch_size
n_batches = math.ceil(n_total / batch_size)
initial_batch = math.ceil(a / batch_size)

with tqdm(total=n_total, initial=a) as pbar:
    for i_batch in range(initial_batch, n_batches):
        try:
            response = requests.post(
                f'{base_url}/entries/archive/query',
                json=dict(
                    query=query,
                    required=required,
                    pagination=dict(page_after_value=page_after_value, page_size=batch_size, order_by=order_by)
            ))
            response_json = response.json()

            n_data = len(response_json['data'])
            if n_data == 0:
                print("response_json['data']=0")
                break
            a += n_data
            pbar.update(n_data)

            data_store.append(response_json['data'])
            page_after_value = response_json['pagination'].get('next_page_after_value')

            # Save the current progress to a file once in a while or when no
            # new results are available
            if a % save_temp_size == 0 or not page_after_value:
                with open(f'data_store_{a}_by_date.json', 'w') as f:
                    json.dump(data_store, f)

                # saves the current page_after_value and a to the progress file
                # if the script is interrupted, the most recent page_after_value and a will be saved.
                with open('progress.json', 'w') as f:
                    json.dump({
                            'page_after_value': page_after_value,
                            'a': a,
                            'entry_id': response_json["data"][-1]["entry_id"]
                        },
                        f)

                data_store = []

        except Exception as e:
            print(f"error occurred: {e}")
            time.sleep(2)
            continue

2 Likes

Also as @mscheidgen mentioned earlier, it should be noted that the entries/query API endpoint will be much faster in retrieving data, but it can only serve a subset of the data, mostly things stored under results. Here is a script that uses that endpoint to retrieve results.properties.geometry_optimization.final_force_maximum

import os
import math
import json
import requests
from tqdm.auto import tqdm

base_url = 'http://nomad-lab.eu/prod/v1/api/v1'
query = {
    'results.properties.geometry_optimization.final_force_maximum': {
        'gt': 1.2e-10
    }
}
required = {
    'include': [
        'results.properties.geometry_optimization.final_force_maximum'
    ]
}

order_by = 'upload_create_time'

n_total = requests.post(
    f'{base_url}/entries/query',
    json=dict(
        query=query,
        required={},
        pagination={'page_size': 0}
)).json()['pagination']['total']

a = 0
batch_size = 10000
save_temp_size = 10 * batch_size
n_batches = math.ceil(n_total / batch_size)
initial_batch = math.ceil(a / batch_size)
data_store = []
page_after_value = None

with tqdm(total=n_total, initial=a) as pbar:
    for i_batch in range(initial_batch, n_batches):
        response = requests.post(
            f'{base_url}/entries/query',
            json=dict(
                query=query,
                required=required,
                pagination={
                    'page_after_value': page_after_value,
                    'page_size': batch_size,
                    'order_by': order_by
                }
            )
        )
        response_json = response.json()
        data = response_json['data']
        n_data = len(data)
        a += n_data
        pbar.update(n_data)
        data_store.append(data)
        page_after_value = response_json['pagination'].get('next_page_after_value')

with open(f'data_store_{a}_by_date.json', 'w') as f:
    json.dump(data_store, f)

This executes quite a bit faster (tens of minutes vs. days), but as mentioned it cannot run very flexible queries that target arbitrary archive contents.

1 Like

Hi Lauri, thanks so much for your support! I’m also wondering how to get access to the chemical formula and different simulation methods of each data point? I would like to also find all desired data associated with corresponding chemical formula and all used simulation methods.

I looked through the documentation and try to write script in this way:

query = {
  'results.properties.available_properties:all': ["geometry_optimization"],
    "results.method.simulation.program_name": "*"

required = {
    'run': {
        "calculation": {"forces": "*", "stress": "*", "energy": "*"}, 
        "system": {
            "atoms": {"species": "*", "positions": "*", "lattice_vectors": "*"},
            "chemical_formula": "*"
        }
    }
}
}

but it seems doesn’t work. Any help would be greatly appreciated!

Hi @JayLiu1!

One problem is that the chemical formulas are not stored under system.chemical_formula, but under the key system.chemical_composition_hill. You can browse our metainfo definitions to inspect which values are defined. I do realize that it is a bit inconvenient that the names of quantities that we use across the data do not always match, we will try to improve on this.

The full details about the used simulation method are stored under run.method: this information may be a bit overwhelming in some cases, and so a summary of the used method is stored under results.method: this does not include all of the details but might be more useful and is also faster to access if you use the entries/query endpoint.

Here is an example of a required setup that returns the Hill formula for each system along with a summary of the simulation method used for the calculation:

required = {
    'run': {
        "calculation": {"forces": "*", "stress": "*", "energy": "*"}, 
        "system": {
            "atoms": {"species": "*", "positions": "*", "lattice_vectors": "*"},
            "chemical_composition_hill": "*"
        }
    },
    "results": {
        "method": "*"
    }
}
1 Like