How to start re-processing of undetected entries

Now that I have the stub lobster parser, I would like to go over all the uploads in our oasis, find and process all lobster calculations to get more tests cases and check what needs fixing. However, nomad admin uploads re-process does not work. Sure, it will re-process all already detected entries, however it does not seem to do anything for the previously undetected stuff, i.e., if I understand the file scheme correctly, it will reprocess just the stuff in “.volumes/fs/public/upload_id/raw-public.plain.zip” but not the stuff in “.volumes/fs/public/upload_id/raw-restricted.plain.zip”, where the files from directories where no mainfile was detected previously are residing? So how can I trigger a full reprocess, including a new mainfile detection?

That use-case is currently not covered. But, it is not difficult to add. I added an additional option reprocess_match (top-level in your nomad.yaml). If set to true, each reprocess will parser match all files and add new entries for matched mainfiles that do not yet exists. After this, all entries are (re)processed as normal.

I’ll give you a ping, when this change is merged to v0.10.4 later today.

Thank you, for some reason I assumed this was already supported. I probably misunderstood the comment here: Archiving whole uploads - #4 by mscheidgen

Not your fault, I was just writing untrue things. I just assumed it would do already a rematch, but did not.

Anyhow, it’s in the v0.10.4 branch now. You have to enable it though, with nomad.yaml::reprocess_match: true.

I recompiled latest v0.10.4, added reprocess_match:true to nomad.yaml, but it doesn’t seem to work. Can it be related to this part of the logs?

ERROR    nomad.processing     2021-05-03T10:27:04 task failed with exception
  - exception: Traceback (most recent call last):
      File "/app/nomad/processing/base.py", line 600, in proc_task
        deleted = func(self)
      File "/app/nomad/processing/data.py", line 1107, in re_process_upload
        raise e
      File "/app/nomad/processing/data.py", line 1064, in re_process_upload
        for filename, parser in self.match_mainfiles():
      File "/app/nomad/processing/data.py", line 1243, in match_mainfiles
        upload_files = self.staging_upload_files
      File "/app/nomad/processing/data.py", line 1175, in staging_upload_files
        assert not self.published
    AssertionError

This is the only error I see right now, but I can turn on more logging if needed.

I was lazy and only tested staging uploads and got an instant penalty for it. I am sorry that you had to be involved.

I already merged a fix to v0.10.4. I hope it works now.

ERROR    nomad.processing     2021-05-03T12:58:42 task failed with exception
  - exception: Traceback (most recent call last):
      File "/app/nomad/processing/base.py", line 600, in proc_task
        deleted = func(self)
      File "/app/nomad/processing/data.py", line 1107, in re_process_upload
        raise e
      File "/app/nomad/processing/data.py", line 1064, in re_process_upload
        for filename, parser in self.match_mainfiles():
      File "/app/nomad/processing/data.py", line 1245, in match_mainfiles
        self._preprocess_files(filename)
      File "/app/nomad/processing/data.py", line 1213, in _preprocess_files
        with open(self.staging_upload_files.raw_file_object(path).os_path, 'rb') as orig_f:
      File "/app/nomad/processing/data.py", line 1175, in staging_upload_files
        assert not self.published
    AssertionError

It looks like it just fails a bit later now?

This time, the test case did not include a POTCAR (which is causing) this. It tested now with a couple of VASP calculations. I hope this was the last incident. The fix is merged to v0.10.4, but the CI/CD is still running.

It doesn’t crash anymore, but I still don’t see any new entries after re-processing.
The log is full of:

.....
sh: 2: cannot open .volumes/fs/staging/8X/8XBfvqjXSf60VuoJ03_QEw/raw/PdAlYAu/shear: No such file
sh: 2: cannot open .volumes/fs/staging/8X/8XBfvqjXSf60VuoJ03_QEw/raw/PdAlYAu/shear: No such file
sh: 2: cannot open .volumes/fs/staging/8X/8XBfvqjXSf60VuoJ03_QEw/raw/PdAlYAu/shear: No such file
sh: 2: Syntax error: "(" unexpected
.....
sh: 2: cannot open .volumes/fs/staging/8X/8XBfvqjXSf60VuoJ03_QEw/raw/PdAlYNi/large: No such file
sh: 2: cannot open .volumes/fs/staging/8X/8XBfvqjXSf60VuoJ03_QEw/raw/PdAlYNi/large: No such file
sh: 2: cannot open .volumes/fs/staging/8X/8XBfvqjXSf60VuoJ03_QEw/raw/PdAlYNi/large: No such file
sh: 2: Syntax error: "(" unexpected
sh: 2: Syntax error: "(" unexpected
sh: 2: Syntax error: "(" unexpected
sh: 2: Syntax error: "(" unexpected
sh: 2: Syntax error: "(" unexpected
.....

especially the sh: 2: Syntax error: "(" unexpected line repeats a lot.

Hm, this is strange, so at the end of the logs I see:

WARNING  nomad.processing     2021-05-03T14:20:18 Unable to parse structure info, no CONTCAR detected
  - nomad.calc_id: moHi0lZPbNjzSxs0HoWk46YrDs2u
  - nomad.commit: 
  - nomad.deployment: standard
  - nomad.mainfile: PdAlYCo/lobster/PBE/lobsterout
  - nomad.processing.logger: nomad.processing
  - nomad.processing.parser: parsers/lobster
  - nomad.processing.proc: Calc
  - nomad.processing.process: re_process_calc
  - nomad.processing.process_status: RUNNING
  - nomad.processing.step: parsers/lobster
  - nomad.processing.task: parsing
  - nomad.processing.tasks_status: RUNNING
  - nomad.release: devel
  - nomad.service: unknown nomad service
  - nomad.upload_id: 8XBfvqjXSf60VuoJ03_QEw
  - nomad.version: 0.10.4
ERROR    nomad.processing     2021-05-03T14:20:18 no "representative" section system found
  - nomad.calc_id: moHi0lZPbNjzSxs0HoWk46YrDs2u
  - nomad.commit: 
  - nomad.deployment: standard
  - nomad.mainfile: PdAlYCo/lobster/PBE/lobsterout
  - nomad.processing.logger: nomad.processing
  - nomad.processing.normalizer: SystemNormalizer
  - nomad.processing.proc: Calc
  - nomad.processing.process: re_process_calc
  - nomad.processing.process_status: RUNNING
  - nomad.processing.step: SystemNormalizer
  - nomad.processing.task: normalizing
  - nomad.processing.tasks_status: RUNNING
  - nomad.release: devel
  - nomad.service: unknown nomad service
  - nomad.upload_id: 8XBfvqjXSf60VuoJ03_QEw
  - nomad.version: 0.10.4
ERROR    nomad.processing     2021-05-03T14:20:18 no "representative" section system found
  - nomad.calc_id: moHi0lZPbNjzSxs0HoWk46YrDs2u
  - nomad.commit: 
  - nomad.deployment: standard
  - nomad.mainfile: PdAlYCo/lobster/PBE/lobsterout
  - nomad.processing.logger: nomad.processing
  - nomad.processing.normalizer: OptimadeNormalizer
  - nomad.processing.proc: Calc
  - nomad.processing.process: re_process_calc
  - nomad.processing.process_status: RUNNING
  - nomad.processing.step: OptimadeNormalizer
  - nomad.processing.task: normalizing
  - nomad.processing.tasks_status: RUNNING
  - nomad.release: devel
  - nomad.service: unknown nomad service
  - nomad.upload_id: 8XBfvqjXSf60VuoJ03_QEw
  - nomad.version: 0.10.4

The warning “Unable to parse structure info, no CONTCAR detected” is from the lobster parser. It fails to read the structure (no CONTCAR) and later normalizer ofc complains, so there seems to be some parsing done on the newly detected entries, but if I check the gui and search the upload, the number of entries is the same as before and there is nothing with code_name=LOBSTER in the upload after the re-processing (I’m also expecting the parser to pick up few more VASP calculations as when this was originally uploaded the bug VASP calculation not detected · Issue #6 · nomad-coe/nomad-parser-vasp · GitHub was still in effect).

I don’t know what the logs are? This does not look like a Python. Bash and sh would also look different? I am not sure.

That the entries do not appear, is NOMAD’s fault again. The new entries have no user metadata set. This includes published, with_embargo, co_authors, shared_with, datasets, comments, references. Maybe you don’t see the entries, because the system still thinks they are not published. Could you verify by logging in (provided this is your upload), or login as the admin user and manually set owner=admin in the search GUI url?

This raises some interesting questions. Setting published true by default makes sense if the upload is published. If an upload is published all its entries are supposed to be published. I could fix that. But the other attributes can be edited on a per entry basis. Leaving the other attributes empty could make sense.

with_embargo is more complex. When you publish an upload, you can set an embargo period on the upload. But, you can lift the embargo on a per entry basis. Should I simply set the embargo false if there is no embargo period on the upload and true otherwise?

Another problem is that the file might be in the wrong .zip file (there is a .zip for the public ones and a .zip for the embargo ones). If the lobster file is in a directory with an already detected entry, it is probably in the public .zip, if the lobster file is in its own directory, it is probably in the restricted .zip file.
Depending on this, it would require to repack these .zip files after each reprocess. There is another cli command for this (nomad admin uploads re-pack).

I have no idea about the logs either. I think it actually looks really looks like a shell (bash?). As far as I can see the only place where is could be potentially called (through the os.system) is in the _preprocess_files for the POTCARS. If you think this is possible I can place some debug prints there to check.

And you were correct, the entries were parsed but are unpublished.

Regarding the rest, I’m not sure if this was a question for me or to your Nomad colleagues :slight_smile:

I definitely agree that if the upload is published, it is right to publish the newly discovered stuff as well.

Regarding the other user edited metadata, we can’t do much. Maybe in some cases, for example from what can I see in our oasis users tend not to assign individual metadata to entries, but rather just select whole upload and add the reference/authors/etc. to all entries. In such cases it might make sense to add the same metadata to the new entry as well (if all the other entries have it as well), but maybe it is different for nomad (and this might be too complicated).

Simple solution could be to email users that new entries were discovered (like the processing completed email) and let them deal with it…

Regarding the embargo, I think the safe solution would be to assume the same embargo as for the whole upload for the newly detected entries (but to be honest I’m not sure I understand 100% how this stuff works in the first place).

And yes, it is more than likely than the new files are in the wrong .zip file. Right now all directories with the undetected main files go into raw-restricted.plain.zip, so if new mainfile is detected in directory where no mainfile was detected previously, is will definitely be in the wrong file. Though I don’t understand here the implications…

BTW the strange logs could be caused by the POTCARS:

sh: 2: cannot open .volumes/fs/staging/8X/8XBfvqjXSf60VuoJ03_QEw/raw/PdAlYNi/large: No such file
sh: 2: Syntax error: "(" unexpected

maybe comes from paths with spaces and brackets in names?

unzip -l raw-restricted.plain.zip | grep POTCAR
...
   400528  2017-01-05 08:02   PdAlYNi/n9 (US)/n92/POTCAR
   400528  2017-10-04 13:17   PdAlYNi/large cell/static_relaxation/static_relaxation0.90/POTCAR
...

Corresponding code in nomad/processing/data.py:1222

    os.system(
        '''
            awk < %s >> %s '
            BEGIN { dump=1 }
            /End of Dataset/ { dump=1 }
            dump==1 { print }
            /END of PSCTR/ { dump=0 }'
        ''' % (
            self.staging_upload_files.raw_file_object(path).os_path,
            self.staging_upload_files.raw_file_object(stripped_path).os_path))

Also, the entries which were newly discovered have POTCAR.stripped.stripped in raw data list, however when one tries to open it there is an error “You are trying to access information that does not exist. Please try again and let us know, if this error keeps happening.”. This suggests that the stripping might not be completely ready for what we are doing with the new entries discovery and reprocessing.

Thanks for the POTCAR investigation. I’ll try to find some lib that can escape the path’s properly.

For now, I will simply add publish and embargo metadata depending on the upload settings. I don’t think that we can avoid to run the re-pack.

Do you need something to fix the metadata on the entries that were discovered yesterday?

In the future, we will very likely discontinue the per entry embargo mechanism. It is very rarely used and creates a lot of complicated solutions and problems.

For the entries discovered yesterday I need some way how to publish all of the newly discovered ones.

I was considering doing chown on the upload to change ownership to myself temporarily, publishing from gui and than chown back, but I’m not sure if this will work (as the upload is already published I’m not sure if the gui is ready for the case when some entries are published and some not)?

I added a fix for the POTCAR path. I could reproduce this with open braces in the path and simply quoted the path in the shell script.

I added a fix for the metadata on newly matched entries. It now sets published based on the upload. It only sets embargo to true, if there is another entry in the upload with embargo.

Both fixes are currently in branch reprocess_match.

I still have to add some CLI functionality to edit metadata. Should not take too long. Once this is ready, you can use it to fix the metadata on the entries created yesterday.

I now also added a simple edit command to fix the metadata. All is now merged in v0.10.4. With the new code, entries should get published automatically (see last comment). For the uploads that you reprocessed with the old build, you can nomad admin uploads edit --publish no-embargo -- <upload_ids> to fix the metadata. If the upload is already published, you have to re-pack the published files. This is for both versions, the re-process won’t do it on its own. You can do nomad admin uploads re-pack -- <upload_ids>. The commands edit and re-pack work like the re-process command, if you give no <upload_ids>, the command will be applied to all uploads. There are no checks on the edit command. It is supposed to fix inconsistencies, like the one we created. Be careful to only apply it to uploads that are already published.

Thank you @mscheidgen, it now works perfectly. I’m really grateful for all the help. We also have a research data management system for experimental data for which we paid quite a lot and the support there is slow compared to support I receive here for free. So huge thanks again to you and rest of the team @ladinesa @laurih .

Thanks for this feedback. It is very much appreciated. I also want to give this back to you, because we need active users and contributors like you to improve NOMAD. In many regards NOMAD is just a framework and it needs custom parsers, normalisers, workflows, visualisations to be useful.

By the way, we are proposing FAIRmat to expand NOMAD towards experimental and synthesis data. We have a lot of groups for various experimental methods on board. It is currently in its infancy, but we hope that this will give some considerable boost to our development efforts. The idea is to turn oasis into a more flexible framework with lots of customisation hooks that allow to integrate it in all kinds of existing (or not existing) local data infrastructures.

Thanks about the FAIRmat link. Unfortunately, our current solution is now fine-tuned to allow the efficient RDM for deposition experiment. We don’t have the RDM for analysis tools working yet, but it would not make sense to use different system for depositions and characterization as they need to link to each other, but maybe we can make our system to be able to export to nomad (or reuse some nomads experimental parsers). I’ll keep an eye on this.

I don’t see this working as well as for the DFT though. All DFT codes are using plain text output as far as I can see, while experimental machines and the analysis software usually use some custom binary format.

For example I saw some work on XPS spectra in NOMAD. We have a KRATOS XPS machine which has its own file format. It contains all the important data and experimental setting (+ also possible fits and qunatification results), but the format is proprietary. You can export to standardized VAMAS format, but at this point you already lose most of your metadata. Than you need some software for analysis that can load the format (we use CasaXPS) and if you are lucky you can save the final output and quantification in some standardized format as well. But the question is what would be the ultimate plan here, push the instrument manufacturer to provide some better formats (that’s not going to work for older machines), have the user to add the metadata manually (they will not be happy about that) or reverse engineer the proprietary formats (I can’t even imagine how difficult that would be)?