You are doing this right things and already uncovered a lot!
I assume that NOMAD is sending some invalid JSON for one (or some) of the entries. Which wouldn’t be too bad, if ArchiveQuery would handle this more gracefully. We will try to investigate this further and see what we can do.
In the meantime, you can try and go beyond Archive Query. The ArchiveQuery
is very specialised. It tries to parallelise multiple API calls for the specific purpose of downloading archive information. But it has its bugs and limitations.
If you run into limitations with ArchiveQuery
, you can access NOMAD APIs more directly. For example with:
import requests
import json
response = requests.post(
'http://nomad-lab.eu/prod/rae/api/v1/entries/archive/query', json={
'query': {
'and': [
{
'dft.code_name': 'VASP',
},
{
'not': {
'atoms': {
'any': ["H", "C", "Li", "Na", "K", "Rb", "Cs"]
}
}
}
]
},
'pagination': {
'page_size': 10,
'page_after_value': '----9KNOtIZc9bDFEWxgjeSRsJrC'
},
'required': {
'section_run': {
'section_single_configuration_calculation[-1]': {
'energy_total': '*'
},
'section_system[-1]': {
'chemical_composition_bulk_reduced': '*'
}
}
}
})
print(json.dumps(response.json(), indent=2))
This uses our new v1 API and might work more reliably. But here you have to paginate yourself. This example will give you page_size=10
results. The results will contain a next_page_after_value
, which you need to populate page_after_value
with in you next request. With a loop like this you can download lots of data. You can increate page_size
, but you have to be careful to not run into timeouts, if the requests become to large.
Another thing, you can use section_single_configuration_calculation[-1]
to only get the last instalment of a section. Many entries contain a lot of systems and calculations (usually with VASP those form a geometry optimisation).
If you are only interested in the formulas and don’t care about energies, etc. You can also skip the archive and just access the basic metadata (formulas are part of this, energies are not).
import requests
import json
response = requests.post(
'http://nomad-lab.eu/prod/rae/api/v1/entries/query', json={
'query': {
'and': [
{
'dft.code_name': 'VASP',
},
{
'not': {
'atoms': {
'any': ["H", "C", "Li", "Na", "K", "Rb", "Cs"]
}
}
}
]
},
'required': {
'include': [
'formula',
'encyclopedia.material.formula',
'encyclopedia.material.formula_reduced'
]
},
'pagination': {
'page_size': 10,
'page_after_value': '----9KNOtIZc9bDFEWxgjeSRsJrC'
}
})
print(json.dumps(response.json(), indent=2))
This does not go through are archive files and only works on top of our search index. Much faster, but only the tip of the available information.