Handling empty values (JSONDecodeError)

Fplanas92 · January 14, 2022, 12:39pm

I am using nomad-lab to download data from a dataset.
The query is the following:

from nomad.client import ArchiveQuery
from nomad.metainfo import units
import pandas as pd

max_entries=1000

query = ArchiveQuery(
query={‘datasets.name’: ‘QMOF Database - HSE06’},
required={‘section_run’: {‘section_single_configuration_calculation’: ‘’,},
‘section_metadata’:’’}, max=max_entries
)

The query goes through. After that, I want to create a dataframe with some magnitudes:

lumo = []
homo = []
names = []
for result in query:
lumo.append(result.section_run[0].section_single_configuration_calculation[0].energy_reference_lowest_unoccupied[0].to(units.eV).magnitude)
homo.append(result.section_run[0].section_single_configuration_calculation[0].energy_reference_highest_occupied[0].to(units.eV).magnitude)
names.append(result.section_metadata.comment)

And this returns the error:
simplejson.errors.JSONDecodeError: Expecting value: line 12536970 column 28 (char 438406985)

Which I understand comes from an empty magnitude. However, I don’t know how to handle it. Even when using try/except, I still get the error.

Thanks for your help,
Ferran

mscheidgen · January 17, 2022, 8:25am

You are doing this right things and already uncovered a lot!

I assume that NOMAD is sending some invalid JSON for one (or some) of the entries. Which wouldn’t be too bad, if ArchiveQuery would handle this more gracefully. We will try to investigate this further and see what we can do.

In the meantime, you can try and go beyond Archive Query. The ArchiveQuery is very specialised. It tries to parallelise multiple API calls for the specific purpose of downloading archive information. But it has its bugs and limitations.

If you run into limitations with ArchiveQuery, you can access NOMAD APIs more directly. For example with:

import requests
import json

response = requests.post(
    'http://nomad-lab.eu/prod/rae/api/v1/entries/archive/query', json={
        'query': {
            'and': [
                {
                    'dft.code_name': 'VASP',
                },
                {
                    'not': {
                        'atoms': {
                            'any': ["H", "C", "Li", "Na", "K", "Rb", "Cs"]
                        }
                    }
                }
            ]
        },
        'pagination': {
            'page_size': 10,
            'page_after_value': '----9KNOtIZc9bDFEWxgjeSRsJrC'
        },
        'required': {
            'section_run': {
                'section_single_configuration_calculation[-1]': {
                    'energy_total': '*'
                },
                'section_system[-1]': {
                    'chemical_composition_bulk_reduced': '*'
                }
            }
        }
    })

print(json.dumps(response.json(), indent=2))

This uses our new v1 API and might work more reliably. But here you have to paginate yourself. This example will give you page_size=10 results. The results will contain a next_page_after_value, which you need to populate page_after_value with in you next request. With a loop like this you can download lots of data. You can increate page_size, but you have to be careful to not run into timeouts, if the requests become to large.

Another thing, you can use section_single_configuration_calculation[-1] to only get the last instalment of a section. Many entries contain a lot of systems and calculations (usually with VASP those form a geometry optimisation).

If you are only interested in the formulas and don’t care about energies, etc. You can also skip the archive and just access the basic metadata (formulas are part of this, energies are not).

import requests
import json

response = requests.post(
    'http://nomad-lab.eu/prod/rae/api/v1/entries/query', json={
        'query': {
            'and': [
                {
                    'dft.code_name': 'VASP',
                },
                {
                    'not': {
                        'atoms': {
                            'any': ["H", "C", "Li", "Na", "K", "Rb", "Cs"]
                        }
                    }
                }
            ]
        },
        'required': {
            'include': [
                'formula',
                'encyclopedia.material.formula',
                'encyclopedia.material.formula_reduced'
            ]
        },
        'pagination': {
            'page_size': 10,
            'page_after_value': '----9KNOtIZc9bDFEWxgjeSRsJrC'
        }
    })

print(json.dumps(response.json(), indent=2))

This does not go through are archive files and only works on top of our search index. Much faster, but only the tip of the available information.

mscheidgen · January 17, 2022, 11:59am

The JSONDecodeError happens if the servers response with an error and ArchiveQuery is trying to interpret this error as JSON.

We will fix ArchiveQuery to properly deal with the error. However, the underlying error is still happening and preventing you from getting the data.

Most likely the error is either a gateway error or gateway timeout. Indicating that our API takes too long or too many resources to fulfil the request. You can try to limit the amount of requested data, e.g.

'required': {
    'section_run': {
        'section_single_configuration_calculation[-1]': {
            'energy_reference_lowest_unoccupied': '*',
            'energy_reference_highest_occupied': '*',
        },
    },
    'section_metadata': {
        'comment': '*'
    }
}

lumo.append(result.section_run[0].section_single_configuration_calculation[-1].energy_reference_lowest_unoccupied[0].to(units.eV).magnitude)
homo.append(result.section_run[0].section_single_configuration_calculation[-1].energy_reference_highest_occupied[0].to(units.eV).magnitude)
names.append(result.section_metadata.comment)

And don’t go too high with the ArchiveQuery per_page parameter.

Fplanas92 · January 17, 2022, 12:26pm

Perfect. Thanks for the help!