LOBSTER parser

This question is regarding a potential parser for LOBSTER: http://schmeling.ac.rwth-aachen.de/cohp

LOBSTER is a package that calculates crystal orbital hamilton populations. So it is not a DFT package, but rather a program which calculates some extra property on top of other DFT calculations: VASP, abinit, and qespresso.

We would like to have it for our oasis (as we have a lot of such calculations around), but preferably would like to get it upstream later so I want to check early if this would be an option and what would be needed.

The question is how to actually structure the parser. So as far as I can see there are already some other parsers (like phonopy and elastic) which work on top of other DFT calculations so it might be acceptable for upstream? But they might be slightly different in the sense that they generate the input (displacements, etc.) for DFT calculations and than parse it later, LOBSTER more or less just reads the wave function, structure and calculates some properties.

I wanted to check if I have thought about this correctly how to populate all the metadata sections:

  • system: Lobster itself doesn’t have this info (in its specific output and input files), but it technically need the structure from the underlying DFT calculation as an input so it should be in the same directory for it to work. The question is if we implement the reading on our own (it should not be that hard as all the DFT packages supported by lobster are supported by ase as well) or if we somehow call the corresponding DFT parser and get it from there?
  • single_configuration_calculation: here would reside all the outputs under x_lobster_* program specific keys/sections, and also section_calculation_to_calculation_refs should link to the underlying DFT calculation, right?
  • method: also lobster specific stuff
  • workflow: here I have no idea? probably no workflow at all?

So in general we don’t touch the general metadata scheme at all, just populate what we can (the system section) and all the rest will go into the x_lobster_* code specific

If this is acceptable for upstream, the question is how we should proceed? Should we just make it work a bit, lets say when it can parse the system, some method details and few properties, than upstream as soon as possible and work on it there, or would it be preferable to develop separately and only upstream when the parser is more complete?

We gladly include contribution for new codes (into our code base and also the central NOMAD) and like to assist you in the process.

You are right, there are other examples of codes that use calculations run with other codes. Ideally our processing would only run the LOBSTER parser after all the other calculations have been parsed, allowing you to access the parsed information of the depending calculations. Currently, we archive this by running additional custom code-specific “normaliser” that is only run after all parsing and normalising is done on everything else. This is hard coded into the processing (bad), but we plan to refactor this anyways. In any case, it can be done and there will be a way to augment your parsing results with data and references from and to the other calculations. I moved the relavant issue to GitHub: Workflow engine parsers · Issue #11 · nomad-coe/nomad · GitHub

The metainfo/archive super-structure should be use like this

  • system: either your own system info or system info taken from one of the depending calcs
  • method: LOBSTER specific non system input information, code settings, computational parameters, etc. Everything that describes how calculations interact might be better places in workflow.
  • single configuration calculation (SCC): your results, e.g. the LOBSTER specific resulting properties
  • workflow: provide references to all calculations and LOBSTER specifics. Workflows can be specialized for calculated properties (e.g. we currently have geometry-optimization, phonon, elastic constants, molecular dynamics trajectories). The general idea is that workflows points to all involved calculations, e.g. VASP, VASP, …, VASP, LOBSTER and the last and final calculation gets an extra easy to access reference because it usually contains the workflow results, i.e., the wanted property.

The x_code_ prefix is used to denote niche specifics that are not super relevant to interpret the results. Of course, it is always debatable and depends on subjective developer decisions. I feel that the final properties that you calculate, should be more general. Anyhow, you simply start doing what you see necessary, and we moderate your metainfo contributions at due time together.

We just had an internal discussion. Putting results of codes like LOBSTER, phonopy, elastic into single configuration calculations is not ideal and should be deprecated. The results should be placed into the workflow sections. Specific workflows/codes should have their own subsection (e.g. as we do with geometry_optimization, phonons, etc). The single configuration calculations should only be used for calculations about a single system. The workflow section should be used for data acquired from many single configuration calculations.

I’m not sure if LOBSTER should be the same as elastic and phonopy in this regards. It really just works on top of one single configuration calculation while phonopy and elastic use (generate) multiple. In this regards, LOBSTER is similar to something like a DFT program submodule to calculate DOS, optics or some other property, except that it has its own package and supports multiple DFT codes.

LOBSTER more or less needs a converged DFT run in the same folder as an input. Could the parser in theory work as some sort of normalizer, to check on the existing DFT calculation (of the supported codes) if some LOBSTER stuff was done on top and than just augment the single configuration calculations with the corresponding section with lobster specific property (or would this be the exact type of hack that you are now trying to get rid of)?

Postprocessing codes with similar workflow would be for example Boltztrap, critic2, so there are definitely more cases like LOBSTER.

Pavel is right in the sense that our current workflows (MD, geometry optimization, phonons) always deal with processing multiple single configurations calculations to arrive at some new properties, whereas LOBSTER is looking at the electronic configuration of a single calculation and deriving new properties.

I’m sure there are also other tools that do a very similar thing as Pavel mentioned, so we should definitely come up with a concept for them. I could imaging two choices:

  1. Same logic as with phonopy: LOBSTER parser is run as the final step, upon which it would fetch the necessary system and methodology from the section_system and section_method of the original calculation. The parser should create a new section_method with the LOBSTER methology and it should create a new section_single_configuration_calculation containing the COHP data and the required references. The “workflow” is essentially stored using section_calculation_to_calculation_refs.
  2. We make our workflow metadata more flexible so that it can meaningfully describe these kinds of “post-processing” steps that can add additional data to a single configuration calculation.

Not sure if I got Lauri correctly, but we cannot break the mainfile-entry relationship that is created via parsing. If there is a VASP and a LOBSTER file, we will have two entries. I don’t think that one parser should modify a foreign entry that was create by another parser. @laurih / @ladinesa is phonopy really doing this?

I think we should treat this as workflow still. It is still two steps. If we are even treating “static” geometry optimizations (1 calc) as a workflow, then this is also a workflow.

Even if you want to treat LOBSTER as an SCC, we still need (github issue):

  • a way to reference other entries (reference the converged VASP scc from an lobster workflow or SCC)
  • a way to make sure that (some part of) the lobster parser is run after the VASP was processed

Sorry maybe my wording was confusing: the LOBSTER parser should only have read-access to the referenced calculation archive. It will of course create its own archive and write all the information there.

I think the properties calculated by LOBSTER should go to an SCC since there is nothing really that distinguishes them from e.g. DOS. Also, there may be some DFT code that internally produces the same quantities as LOBSTER, in which case they would go to an SCC as well.

The information about the workflow logic could then go into section_workflow. We should maybe then also reconsider how we e.g. handle GW and other perturbation calculations: for now they do not produce a section_workflow either. This is all pretty close to what we did with MD/geo opt: we used to store the workflow information inside section_run (section_frame_sequence, section_sampling method) which we are now instead storing in section_worklow. Now we should perhaps try do the same for all calculations that build on top of existing SCCS (the SCC can be stored in the same archive or referenced from another archive).

OK guys, I’m fine with whatever you decide.

I’m working now to get some basic parsing working (right now still using the scheme from second comment, however with no workflow), than I’ll push to my public repo at github and we can discuss further steps there.

I think Lobster is still the same as elastic, phonopy or fhi-vibes. It is just that it is a workflow engine for a single_point calculation. It should have its own parser hence it should produce a section_run which is independent from the underlying code.

I have a simple lobster parser in GitHub - ondracka/nomad_parser_lobster

It can parse some basic settings and few quantities, but I’m not so sure about some things. I would be grateful for an review, so I can get the architecture right before I add more features. There is no integration to nomad-fair yet and no testing related to it was done.

General questions and issues:

  • How do you want to discuss it, here? On github?
  • Right now I put calculated quantities into section_run/section_scc, this is where the same quantities belong for DFT codes and even though lobster is no DFT, it calculates DOS, partial DOS and mulliken charges which all belong there. Is this OK?
  • How to parse the structure: there is no structure information in the lobster output itself, however it needs CONTCAR, OUTCAR and vasprun.xml (and others) as an input. Right now I’m parsing it myself from CONTCAR and populating the section_system, but I’m open to other options (like somehow linking to the vasp calculation parsed by vasp parser), as long as the lobster stuff is searchable by the composition and structure
  • If structure parsing will be integrated into the lobster parser, support might be needed for other codes supported by lobster (ab init, quantum espresso) , however I have no experience there.
  • I have no idea how to handle the workflow stuff, was there actually some concensus how I should approach this?
  • Tests are failing, they work here locally with pytest, so this must be some python/library version nuance, looks related to datetime?
  • I’m not sure about the correct handling of units, I’ve looked as some parsers and the correct way seems to be from nomad.units import ureg as units; x = y * units.angstrom , however I have no idea how to make this work with numpy array and also in the tests, so I’m using possibly some hacks there.
  • I have a DOSCAR.lobster which is similar to VASP DOSCAR, should I just copy the code or is there some way for code sharing between parser?
  • Pytest approx is not working (always passing) for too small values (elemental charge magnitude)

Most of the problems are marked in the code by FIXME comments.

First of all, thank you for the work and contribution.

  • I think it is fine to have the broader discussion here.
  • Yes, using the existing structure at the moment. In principle, section_dos and similar can also be used in other places (e.g. workflows), but we are still discussing the overall workflow things.
  • I think having a fallback structure parser (especially in the beginning) is good. Once we figured out, how to have more complex processing (Workflow engine parsers · Issue #11 · nomad-coe/nomad · GitHub) that can depend on other entries, you can start accessing the archives of the underlying calculations.
  • Using nomad’s unit registry is right. You should be able to simply multiply np arrays with units as well. If the quantity you set has a unit attach, it will automatically assume this unit.
  • To build/test parsers you have to use at least nomad-lab[parsing]? The VASP code is part of this. Maybe you can use the respective functions as is (e.g. from vaspparser import ..., maybe we need to expose it better for outside use? You can do a PR on the vasp project with suggestions.
  • You can pass abs and rel keyword args to approx to set the tolerance. Default is relative tolerance of 1e-6.

We really have to start working on the workflow, depending on other calcs thing.

@ladinesa Can you have a look at the parser and help Pavel with his #FIXME issues? I guess Alvin could fork and use pull request to suggest fixes?

How should we proceed? It is you call, if/when you want to have the parser linked in our official builds. Of course it is also your decision, whether you keep the parser under your account or want to contribute it to the nomad-coe organization.

Thanks, I think that as soon as I fix the tests issues and the units issues it might be best to just move the parser to the nomad-coe organization for better long term support (I can’t really say how long I will contribute. I’m now tasked with getting our internal oasis working for our usual stuff, but when its done I can’t say if I keep an eye on the parser or not).

I think I figured out the units mostly, just for some reason the default approx behavior seems to be absolute 1e-12 not the relative 1e-6?

I must be doing something stupid…

print(pytest.approx(1))
print(pytest.approx(1e-3))
print(pytest.approx(1e-5))
print(pytest.approx(1e-6))
print(pytest.approx(1e-9))
print(pytest.approx(1e-12))
print(pytest.approx(1e-15))

1 ± 1.0e-06
0.001 ± 1.0e-09
1e-05 ± 1.0e-11
1e-06 ± 1.0e-12
1e-09 ± 1.0e-12
1e-12 ± 1.0e-12
1e-15 ± 1.0e-12

print(pytest.approx(1, rel=1e-6))
print(pytest.approx(1e-3, rel=1e-6))
print(pytest.approx(1e-5, rel=1e-6))
print(pytest.approx(1e-6, rel=1e-6))
print(pytest.approx(1e-9, rel=1e-6))
print(pytest.approx(1e-12, rel=1e-6))
print(pytest.approx(1e-15, rel=1e-6))

1 ± 1.0e-06
0.001 ± 1.0e-09
1e-05 ± 1.0e-11
1e-06 ± 1.0e-12
1e-09 ± 1.0e-12
1e-12 ± 1.0e-12
1e-15 ± 1.0e-12

In order to understand the pytest.approx behaviour you really have to read the documentation: API Reference — pytest documentation

Essentially:

  • By default, approx considers numbers within a relative tolerance of 1e-6 (i.e. one part in a million) of its expected value to be equal. This treatment would lead to surprising results if the expected value was 0.0, because nothing but 0.0 itself is relatively close to 0.0. To handle this case less surprisingly, approx also considers numbers within an absolute tolerance of 1e-12 of its expected value to be equal.
  • If you specify abs but not rel, the comparison will not consider the relative tolerance at all.
  • If you specify both abs and rel, the numbers will be considered equal if either tolerance is met

I did read it, but the question was more like how should I fix this. If I read the docs correctly, one can’t really use approx for small numbers, because than the default 1e-12 abs check is used.

However current parsers have code like (from vasp parser):
sec_sccs[2].energy_reference_lowest_unoccupied[0].magnitude == pytest.approx(7.93718304e-19)
so I’m I missing something or is this just plain broken?

I could ofc do:
def nomad_approx(value):
return pytest.approx(value, abs = val/1e6)

and use it, would this be OK?

Ah, I see. That test is definitely broken and should be fixed. In order to test smaller numbers, you need to override the default abs value, probably setting it to zero is the easiest choice. Thanks for bringing this up!

@ladinesa: The pytest.approx tests need to be checked in all of the parsers for this issue.

OK, thanks for the tip with zero, so I guess:

def nomad_approx(value):
    return pytest.approx(value, abs=0, rel=1e-6)

is the more elegant solution (one just needs to use the standard one when comparing with zero).

One thing that doesn’t work right now is that the lobster specific metainfo is not recognized/loaded by nomad. Any ideas how to solve this?

I’ll try to rephrase my last question. What is needed to properly integrate the parser into the nomad? My integration commit looks like this right now:

commit dc2bbb144e20335c4b907d53120df96d4e8236da (HEAD -> master)
Author: Pavel Ondracka <[email protected]>
Date:   Fri Apr 30 12:00:28 2021 +0000

    LOBSTER parser

diff --git a/.gitmodules b/.gitmodules
index aa9080e3..5cb8c279 100644
--- a/.gitmodules
+++ b/.gitmodules
@@ -205,3 +205,6 @@
 [submodule "dependencies/parsers/fhi-vibes"]
        path = dependencies/parsers/fhi-vibes
        url = https://github.com/nomad-coe/nomad-parser-fhi-vibes.git
+[submodule "dependencies/parsers/lobster"]
+       path = dependencies/parsers/lobster
+       url = https://github.com/ondracka/nomad_parser_lobster
diff --git a/dependencies/parsers/lobster b/dependencies/parsers/lobster
new file mode 160000
index 00000000..aae6ec7a
--- /dev/null
+++ b/dependencies/parsers/lobster
@@ -0,0 +1 @@
+Subproject commit aae6ec7a455776fde4301b0214941faf94c2a006
diff --git a/gui/src/parserMetadata.json b/gui/src/parserMetadata.json
index eda6feb5..229a2899 100644
--- a/gui/src/parserMetadata.json
+++ b/gui/src/parserMetadata.json
@@ -258,6 +258,14 @@
     "parserSpecific": "",
     "tableOfFiles": ""
   },
+  "LOBSTER": {
+    "codeLabel": "LOBSTER",
+    "codeLabelStyle": "All in capitals",
+    "codeUrl": "http://schmeling.ac.rwth-aachen.de/cohp/",
+    "parserDirName": "dependencies/parsers/lobster/",
+    "parserGitUrl": "https://github.com/nomad-coe/nomad-parser-lobster",
+    "tableOfFiles": "|Input Filename| Description|\n|--- | --- |\n|`lobsterout` | **Mainfile** in LOBSTER specific plain-text |"
+  },
   "MOLCAS": {
     "codeLabel": "Molcas",
     "codeLabelStyle": "Capitals: M; also seen all in capitals",
diff --git a/nomad/datamodel/material.py b/nomad/datamodel/material.py
index f9fb8d4d..5eaf83ee 100644
--- a/nomad/datamodel/material.py
+++ b/nomad/datamodel/material.py
@@ -174,7 +174,7 @@ class Method(MSection):
         """
     )
     program_name = Quantity(
-        type=MEnum("ABINIT", "Amber", "ASAP", "ATK", "BAND", "BigDFT", "CASTEP", "Charmm", "CP2K", "CPMD", "Crystal", "DFTB+", "DL_POLY", "DMol3", "elastic", "elk", "exciting", "FHI-aims", "fleur", "fplo", "GAMESS", "Gaussian", "GPAW", "Gromacs", "Gromos", "gulp", "LAMMPS", "libAtoms", "MOLCAS", "MOPAC", "Namd", "NWChem", "Octopus", "ONETEP", "OpenKIM", "ORCA", "Phonopy", "qbox", "Quantum Espresso", "Siesta", "TINKER", "turbomole", "VASP", "WIEN2k"),
+        type=MEnum("ABINIT", "Amber", "ASAP", "ATK", "BAND", "BigDFT", "CASTEP", "Charmm", "CP2K", "CPMD", "Crystal", "DFTB+", "DL_POLY", "DMol3", "elastic", "elk", "exciting", "FHI-aims", "fleur", "fplo", "GAMESS", "Gaussian", "GPAW", "Gromacs", "Gromos", "gulp", "LAMMPS", "libAtoms", "MOLCAS", "MOPAC", "Namd", "NWChem", "Octopus", "ONETEP", "OpenKIM", "ORCA", "Phonopy", "qbox", "Quantum Espresso", "Siesta", "TINKER", "turbomole", "VASP", "WIEN2k", "LOBSTER"),
         a_search=Search(),
         description="""
         Name of the program used for this calculation.
diff --git a/nomad/parsing/parsers.py b/nomad/parsing/parsers.py
index 1b9dc1c3..1cca6161 100644
--- a/nomad/parsing/parsers.py
+++ b/nomad/parsing/parsers.py
@@ -47,6 +47,7 @@ from turbomoleparser import TurbomoleParser
 from castepparser import CastepParser
 from wien2kparser import Wien2kParser
 from nwchemparser import NWChemParser
+from lobsterparser import LobsterParser
 
 try:
     # these packages are not available without parsing extra, which is ok, if the
@@ -143,6 +144,7 @@ parsers = [
     ExcitingParser(),
     FHIAimsParser(),
     FHIVibesParser(),
+    LobsterParser(),
     CP2KParser(),
     CrystalParser(),
     # The main contents regex of CPMD was causing a catostrophic backtracking issue

This is enough to make the parsing work, however the lobster specific metainfo is not recognized, i.e., when I press “code specific” checkbox in the archive viewer, I don’t see any. Am I missing some integration in Nomad or do I have some error in my metainfo definition on the lobster side? The LOBSTER-specific metainfo is there, if I download the archive, I can see it in the text file, its just the GUI that doesn’t work.