AttributeError 'NoneType' object has no attribute 'encyclopedia'

acarnevali · December 17, 2021, 12:10pm

What I have been doing

Hi!
I am trying to download part of the DB information using the API via the nomad python package: namely some basic pieces of information for every bulk material present, according to the following code:

query = ArchiveQuery(
  query={
    'domain': 'dft',
    'dft.system': 'bulk',
    },
  required = {
      'section_run':{
        'section_system':{
          'atom_species':'*'
        }
      },
      'section_metadata': {
        'encyclopedia': {
            'material': {
                'formula': '*',
                'bulk': {
                  "bravais_lattice": "*",
                  "crystal_system": "*",
                  "point_group": "*",
                  "space_group_number": '*',
                  "space_group_international_short_symbol": "*",
                  "structure_prototype": "*",
                  "structure_type": "*",
                },
                'idealized_structure': {
                  "atom_labels": '*',
                  "atom_positions": '*',
                  "number_of_atoms": '*',
                  "cell_volume": '*',
                  "lattice_parameters": {
                    "a": '*',
                    "b": '*',
                    "c": '*',
                    "alpha": '*',
                    "beta": '*',
                    "gamma": '*',
                    },
                  },
          },
          'properties': {
            'atomic_density': '*',
            'mass_density': '*',
            'energies': '*',

                        },
          'method': {
            'functional_type': '*',
            'functional_long_name': '*',
            },
        },
      },
    },
per_page = 100,
max = None,
)

printing such a query results in a number of queried entries of 935760.(Q1)
Now, in order to double check the number of effectively fetched materials, I attached the following simple code to the previous one:

l =[]
for result in query:
  formula = result.section_metadata.encyclopedia.material.formula,
  l.append(formula)
  print(len(l))

expecting a final printed value corresponding to the number of queried entries. (Q2)
This was not the case, since the script encountered the error reported in the title (AttributeError ‘NoneType’ object has no attribute ‘encyclopedia’ (Q3)).

Questions

Q1: Why is the number of queried entries so “low”? I guess it is because of the numerous constrains I put in the required attribute of ArchiveQuery, but I could not verify this with the NOMAD GUI research, since I could not find a way to insert such an attribute in the query.

Q2: is the number of queried entries equal to the number of material phases present in the DB? I expect the list l to contain also doubles of the same formulas, corresponding to different phases: is this correct?

Q3: It appears that ArchiveQuery fetches objects without the attribute ‘encyclopedia’: why so? Are there some materials with an encyclopedia entry not completed yet, or is it something deeper?

Q4: I am planning on saving locally the fetched info in two pandas dataframes:

a simple one containing all the "monodimensional’ pieces of info, like formula, point group, total energy, lattice vectors and angles etc
a python dictionary with keys as formula_crystal_system (hopefully unique for every DB entry and easy to identify by the user) and values as pandas dataframes structured as follows:

atom x y z
A     (vector of atom positions from idealised structure, i.e. multiples of the lattice vectors)
A     (same)
B     (same)
C     (same)

in order to allow an arbitrary number of atoms for each key, hence flexibility in storing the data. Does that sound reasonable, or would you suggest anything else? The final application for this will be training generative ML algorithms like GANs.

Thanks a lot,
Antonio

mscheidgen · December 17, 2021, 3:53pm

Q1: Yes, you are right. The search is restricted with the elements contained in required. If you use the GUI, you can type quantities=structure_prototype info the search bar. It’s not perfect, but might give you an idea. ~1M really sounds very low, please experiment to see what required quantity is causing this.

Q2: For most DFT data “entries” refers to a single code run. We produce only one encyclopedia section per entry. This one is typically based on the “last” calculation of a code run. If the code run performed a geometry optimization for example, the encyclopedia will reflect the relaxed structure. The material_id in encyclopedia is a hash based on elements and crystal symmetry. There might be entries that have the same material id (12M entries vs 3M materials).

Q3: I am not sure what you are referring to? Practically all entries should have an encyclopedia section. There are “only” a couple 10k missing, probably due to processing problems.

Q4: It makes sense to cache the results. You’ll see that is can take quite sometime to collect all data. I would also suggest to restrict to certain codes (e.g. dft.code_name: 'VASP'). You might also break you queries down and loop though all elements or something.

More tips

The ArchiveQuery is very specialised. It tries to parallelise multiple API calls for the specific purpose of downloading archive information. But it has its bugs and limitations. It also seems that you are only interested in encyclopedia (see last example below)?

If you run into limitations with ArchiveQuery, you can access NOMAD APIs more directly. For example with:

import requests
import json

response = requests.post(
    'http://nomad-lab.eu/prod/rae/api/v1/entries/archive/query', json={
        'query': {
            'and': [
                {
                    'dft.code_name': 'VASP',
                },
                {
                    'not': {
                        'atoms': {
                            'any': ["H", "C", "Li", "Na", "K", "Rb", "Cs"]
                        }
                    }
                }
            ]
        },
        'pagination': {
            'page_size': 10,
            'page_after_value': '----9KNOtIZc9bDFEWxgjeSRsJrC'
        },
        'required': {
            'section_run': {
                'section_single_configuration_calculation[-1]': {
                    'energy_total': '*'
                },
                'section_system[-1]': {
                    'chemical_composition_bulk_reduced': '*'
                }
            }
        }
    })

print(json.dumps(response.json(), indent=2))

This uses our new v1 API and might work more reliably. But here you have to paginate yourself. This example will give you page_size=10 results. The results will contain a next_page_after_value, which you need to populate page_after_value with in you next request. With a loop like this you can download lots of data. You can increate page_size, but you have to be careful to not run into timeouts, if the requests become to large.

Another thing, you can use section_system[-1] to only get the last instalment of a section. Many entries contain a lot of systems and calculations (usually with VASP those form a geometry optimisation).

If you are only interested in the encyclopedia section. You can also skip the archive and just access the basic metadata (formulas are part of this, energies are not).

import requests
import json

response = requests.post(
    'http://nomad-lab.eu/prod/rae/api/v1/entries/query', json={
        'query': 
           {
              'dft.code_name': 'VASP',
           }          
        },
        'required': {
            'include': [
                'encyclopedia.*',
            ]
        }
    })

print(json.dumps(response.json(), indent=2))

This does not go through are archive files and only works on top of our search index. Muuuuch faster!

acarnevali · January 15, 2022, 11:49am

Thank you for your help!
Q3) I solved the issue by simply using a try/except pass in the for result in query: cycle. It results in skipping a reasonably small amount of queries.
More Tips) This question may be trivial, but I am not sure how to paginate properly a requests.post request. What other values should I add, together with page_size in order to result in a request that is not limited by itself, but rather fetches all the data pointed out in the query: section?