import requests
import json
base_url = 'http://nomad-lab.eu/prod/v1/api/v1'
# The response from the API is stored in the response variable
response1 = requests.post(
f'{base_url}/entries/query',
json={
'query': {
'results.properties.geometry_optimization.final_force_maximum': {
'gt': 1.2e-10
}
},
'required': {
'run': {"calculation[-1]": {"forces": "*"}}
}
})
response1_json = response1.json()
response1_json['data'][-1]['quantities']
Explanation
The geometry_optimization and forces related data are stored in the above variable. However, it seems that there are only numerical values of âcell_volumeâ and âlattice_parametersâ. Iâm wondering how to retrieve the numerical quantities of the forces data? which should be a vector with one row and three columns(for example: (a, b, c)), indicating three directions x, y, z for each atom in crystal(eg. FeO), at each equilibrium at each type of calculation.
There are two different APIs targeting different data. The entries API (entries/query) will only return a thin top-layer of metadata. This is basically what is stored in sections metadata and results when you look at any entry in the âdataâ tab (e.g. here). Here you will only find what quantities exist in the entry (under the quantities key), but not the values of these quantities.
To target the actual values (or anything that is stored in section run, like run/calculation/forces/total/value), you have to use the archive API (entries/archive/query). Here the required works like you wanted it to use with a dictionary containing * placeholder.
This would be your example, extracting the force value vector:
Hi Markus, thanks a lot! I have other two questions:
In the ârequiredâ part, we specify the last calculation, i.e. âcalculation[-1]â. So it is easy to check that the length of âarchive[ârunâ][-1][âcalculationâ]â is 1:
len(archive['run'][-1]['calculation'])==1
If we donât specify the last calculation in the ârequiredâ part, and replace âcalculation[-1]â with âcalculationâ, it would return 12 results in this example:
len(archive['run'][-1]['calculation'])==12
Meanwhile, when I explore data on NOMAD, I navigate in this way: Explore-Entries, then I click a certain result and âGO TO THE ENTRY PAGEâ, and I find that there is a figure of âGeometry optimizationâ in the right card. In X-axis of the figure there is a âStep numberâ.
So my question is, is the âcalculationâ exactly what the âStep numberâ means in âGeometry optimizationâ figure?
In the above example, there is only one element in the list of âarchive[ârunâ]â
len(archive['run'])==1
However, we didnât specify the last ârunâ like ârun[-1]â in ârequiredâ part, like we can do with âcalculationâ or âcalculation[-1]â. My question is how many ârunâ can be retrieved? or is there actually only one ârunâ for each âentry_idâ?
By the way, could you please clarify the difference among âentry_idâ, âupload_idâ, and âmaterial_idâ ? Any help would be appreciated!
The required part allows you to specify what you want to download from the API. Therefore, you get more or less calculations depending on your required.
Hi Markus, thanks for your explanation. When I applied the code to retrieve desired data, it only returned 10 results by default, which means âpage sizeâ=10, the maximum number of items contained in one response:
len(response.json()['data'])==10
My first question is how to change the default setting âpage_size=10â in âpaginationâ, so it would return ALL available data that meet the above criterions.
When I print out the response, I noticed that there is another parameter âorder_byâ. My second question is, how to set the output ordered by time, for example, change the âorder_byâ from âentry_idâ to âdateâ, so that it would return data chronologically.
The request is structured analogous to the response and your request object can contain a pagination key with an object that specifies page_size and order_by. Of course, you cannot retrieve all data with one request and you will need to go through a loop. You can read about pagination here: Using the APIs - Documentation. You can play with the page_size, but usually you should not go higher than 1000. If you go to high, their either is a maximum, or the thread answering your request will run out of memory or into a timeout.
The key that the search interface on the UI is using to order by time is upload_create_time. Also on the search page of the UI there is this <> button on top of the filter that shows you the API requests that the UI is performing. The pagination here works the same as the pagination on the archive endpoints. This is copied from there as an example:
The page_after_value is a combination of the entry_id and the order_by: upload_create_time value. Basically gives you the next 20 values after the entry with these values.
Hi Markus, I do appreciate for your patience. I looked through the NOMAD Documentation and havenât found an efficient way to retrieve ALL data. In my case, the POST request is continuously sent to the NOMAD API with the query, required fields, and pagination settings(e.g. page_size=10), and store retrieved data into a JSON file in a batch size(e.g. every 100 entries). To make the program continuously, I also store the current âpage_after_valueâ and âaâ to a JSON file every batch the data is stored. If the program is interrupted and restarted, it will first look for the most recent file and will start from the saved âpage_after_valueâ and âaâ. HOWEVER, this still seems to take endless time to retrieve ALL the data from the NOMAD database.
import requests
import json
from tqdm.auto import tqdm
import time
import os
base_url = 'http://nomad-lab.eu/prod/v1/api/v1'
query = {
'results.properties.geometry_optimization.final_force_maximum': {
'gt': 1.2e-10 # where the value is greater than 1.2e-10
}
}
required = {
'run': {
"calculation": {"forces": "*", "stress": "*", "energy": "*"},
"system": {"atoms": {"species": "*", "positions": "*", "lattice_vectors": "*"}}
}
}
page_size = 10
order_by = 'upload_create_time'
data_store = []
# Loading Saved Progress: checks for a saved progress file.
# Check if there's a saved progress file
# If it exists, it loads the page_after_value and a (the request counter) from this file
if os.path.exists('progress.json'):
with open('progress.json', 'r') as f:
progress = json.load(f)
page_after_value = progress['page_after_value']
print("page_after_value =", page_after_value )
print('\n')
a = progress['a']
print("a=", a)
# If the file doesn't exist, it sets page_after_value to None and a to 0.
else:
page_after_value = None
a = 0
# progress bar and loop now start from the saved value of a, instead of always starting from 0, if the script was interrupted and restarted
m = 10109108
for i in range(a, m): # starts from 'a'
a += page_size
with tqdm(total=m, initial=a) as pbar:
for i in range(a, m):
a += 1
pbar.update(1)
try:
response = requests.post(
f'{base_url}/entries/archive/query',
json=dict(
query=query,
required=required,
pagination=dict(page_after_value=page_after_value, page_size=page_size, order_by=order_by)
))
response_json = response.json()
if len(response_json['data']) == 0:
print("response_json['data']=0")
break
data_store.append(response_json['data'])
page_after_value = response_json['pagination']['next_page_after_value']
# save the current progress to a file every time it saves the retrieved data
if a % 100 == 0:
with open(f'data_store_{a}_by_date.json', 'w') as f:
json.dump(data_store, f)
# saves the current page_after_value and a to the progress file
# if the script is interrupted, the most recent page_after_value and a will be saved.
with open('progress.json', 'w') as f:
json.dump({'page_after_value': page_after_value, 'a': a, 'entry_id': response_json["data"][-1]["entry_id"]}, f)
data_store = []
except Exception as e:
print(f"error occurred: {e}")
time.sleep(2)
continue
# retrieve the remaining data that hasn't been written to file in one batch(m%100)
if len(data_store) > 0:
with open(f'data_store_{a}.json', 'w') as f:
json.dump(data_store, f)
I am afraid it is not a super fast process. There are a few things that might improve the process:
You are requiring all calculations and all systems. Not just the optimization results. Is this really what you want. These are ~100 million.
If you basically want the whole archives, you can try this endpoint: /uploads/{upload_id}/archive/{entry_id}. You have todo a search with with this endpoint to get the entry and upload ids.
We are currently testing a new implementation of the underlying file format that hopefully resolves some of the performance problems. But this requires that we reprocess all the data first. I donât think this will be usable for another 3 months.
There are plans for an API that exports uploads as a whole. The intention is to allow mirrors and stuff like you are doing. There is some implementation that you can test: /uploads/{upload_id}/bundle. You are only interested in include_archive_files. This will provide zip files that contain the archive data in message pack format.
There might be other ways to give you the data. It would be great to learn a bit about the brackground and what you are trying to do. Please write to [email protected].
I think there might be a few problems with this script, which also makes the time estimate unreasonable:
You might be looping too much: there are two nested loops that loop from a to m: I donât think this is supposed to happen.
You are making an API request for each of the 10 109 108 results. Make batched API calls.
Youâre page size is quite small: this will result in more disc writes and API calls. Try increasing it.
Are you sure you want entries with final_force_maximum greater than 1.2e-10N? Typically you would be interested in the ones where this optimization force is smaller than some value.
Here is a modified version (does not address issue #4). Maybe this will help with getting the results faster?
import requests
import math
import json
from tqdm.auto import tqdm
import time
import os
base_url = 'http://nomad-lab.eu/prod/v1/api/v1'
query = {
'results.properties.geometry_optimization.final_force_maximum': {
'gt': 1.2e-10 # where the value is greater than 1.2e-10
}
}
required = {
'run': {
"calculation": {"forces": "*", "stress": "*", "energy": "*"},
"system": {"atoms": {"species": "*", "positions": "*", "lattice_vectors": "*"}}
}
}
order_by = 'upload_create_time'
data_store = []
# Loading Saved Progress: checks for a saved progress file.
# Check if there's a saved progress file
# If it exists, it loads the page_after_value and a (the request counter) from this file
if os.path.exists('progress.json'):
with open('progress.json', 'r') as f:
progress = json.load(f)
page_after_value = progress['page_after_value']
print("page_after_value =", page_after_value )
print('\n')
a = progress['a']
print("a=", a)
# If the file doesn't exist, it sets page_after_value to None and a to 0.
else:
page_after_value = None
a = 0
# progress bar and loop now start from the saved value of a, instead of always
# starting from 0, if the script was interrupted and restarted
n_total = requests.post(
f'{base_url}/entries/archive/query',
json=dict(
query=query,
required={},
pagination=dict(page_size=0, order_by=order_by)
)).json()['pagination']['total']
batch_size = 250
save_temp_size = 10 * batch_size
n_batches = math.ceil(n_total / batch_size)
initial_batch = math.ceil(a / batch_size)
with tqdm(total=n_total, initial=a) as pbar:
for i_batch in range(initial_batch, n_batches):
try:
response = requests.post(
f'{base_url}/entries/archive/query',
json=dict(
query=query,
required=required,
pagination=dict(page_after_value=page_after_value, page_size=batch_size, order_by=order_by)
))
response_json = response.json()
n_data = len(response_json['data'])
if n_data == 0:
print("response_json['data']=0")
break
a += n_data
pbar.update(n_data)
data_store.append(response_json['data'])
page_after_value = response_json['pagination'].get('next_page_after_value')
# Save the current progress to a file once in a while or when no
# new results are available
if a % save_temp_size == 0 or not page_after_value:
with open(f'data_store_{a}_by_date.json', 'w') as f:
json.dump(data_store, f)
# saves the current page_after_value and a to the progress file
# if the script is interrupted, the most recent page_after_value and a will be saved.
with open('progress.json', 'w') as f:
json.dump({
'page_after_value': page_after_value,
'a': a,
'entry_id': response_json["data"][-1]["entry_id"]
},
f)
data_store = []
except Exception as e:
print(f"error occurred: {e}")
time.sleep(2)
continue
Also as @mscheidgen mentioned earlier, it should be noted that the entries/query API endpoint will be much faster in retrieving data, but it can only serve a subset of the data, mostly things stored under results. Here is a script that uses that endpoint to retrieve results.properties.geometry_optimization.final_force_maximum
This executes quite a bit faster (tens of minutes vs. days), but as mentioned it cannot run very flexible queries that target arbitrary archive contents.
Hi Lauri, thanks so much for your support! Iâm also wondering how to get access to the chemical formula and different simulation methods of each data point? I would like to also find all desired data associated with corresponding chemical formula and all used simulation methods.
I looked through the documentation and try to write script in this way:
One problem is that the chemical formulas are not stored under system.chemical_formula, but under the key system.chemical_composition_hill. You can browse our metainfo definitions to inspect which values are defined. I do realize that it is a bit inconvenient that the names of quantities that we use across the data do not always match, we will try to improve on this.
The full details about the used simulation method are stored under run.method: this information may be a bit overwhelming in some cases, and so a summary of the used method is stored under results.method: this does not include all of the details but might be more useful and is also faster to access if you use the entries/query endpoint.
Here is an example of a required setup that returns the Hill formula for each system along with a summary of the simulation method used for the calculation: