Wrong parsing in oasis (possibly some bad settings somewhere?)

Pavel_Ondracka · April 22, 2021, 2:30pm

I did not want to open a new bug, as I’m not at this point sure if it is a bug or just bad configuration somewhere on our part.

I have some parsing failures in our oasis that I can’t reproduce in central nomad. One example was the stuff mentioned in Another parsing failure: "TypeError: unsupported operand type(s) for +: 'int' and 'str'" · Issue #8 · nomad-coe/nomad-parser-vasp · GitHub and now I have one more that I can’t reproduce in central nomad. See the attached OUTCAR. OUTCAR.zip (39.2 KB)

Here it complains that there is “Inconsistent number of ions and species.” and both atoms_labels and atom_species in section_system are empty.

What I have here is just 0.10.2 with minor changes (latest vasp parser and few patches to change the webpage wording to be more relevant for our oasis, but no functionality changes and then I do “docker build .” and deploy using the standard docker-compose.yaml from Operating an OASIS — NOMAD Repository and Archive documentation). Any ideas what could be going wrong and what should I do to track this difference down?

mscheidgen · April 26, 2021, 6:22am

How recent was the upload to NOMAD? Or better, what processing version is shown on the central nomad (its part of the metadata shown on an entries overview page)?

It is super unlikely that this is caused by a different configuration. I can only imagine that this is some regression in the parser. Probably due to our new vasp/outcar parser (when switching from NOMAD version 0.9.x to 0.10.x).

Pavel_Ondracka · April 26, 2021, 7:14am

Central nomad 0.10.2/ce3954e5 works OK.

The strange thing is if I copy the specific OUTCAR to my running oasis worker container with docker cp, connect with docker exec -ti nomad_oasis_worker /bin/bash and run nomad parse --show-archive --parser parsers/vasp OUTCAR directly, it is parsed correctly. So it is only when I upload the same zipped file through the gui than the atoms_labels and atom_species in section_system are not parsed.

I’m running self-build (standard docker build .) c90507e56f2f807556a61c3076acd79ea8607356 nomad-FAIR with latest vasp parser cbdce6c1287f186924e91c7d6ba2ff14e23f81ae (and few custom patches which only change the webpage) through docker-compose.

So it must be something with the docker/oasis settings as it is not influencing the parser when called from commandline directly or possibly problem with some component before the parser itself (which I skip when copying the OUTCAR directly)?

mscheidgen · April 26, 2021, 7:36am

Ok, hard to reproduce, everyone’s favorit bug. I think you missed something in the second last paragraph. Does it work with your own build or not? What build (image hash) are you using on the oasis, where you get the error? I would try to reproduce this with the same image.

I can only imagine that zip introduces some encoding problem (but it should be binary). Or that the image somehow has two versions of the parser (one in site-packages, another in the sources /app/...) installed (but I don’t know how this can happen). Have you changed anything in your container? Like run pip on nomad-lab or something?

Pavel_Ondracka · April 26, 2021, 9:23am

Now to add more mystery. If I restart the oasis with docker-compose down; docker-compose up -d everything works, so I thought this could be really something stupid on my part and thought I would reprocess the upload with the problematic cases. So I run nomad admin uploads re-process -- -NC3zeQuRTOlGO6wkrV7qg from the worker image but when I checked, the entries were still parsed wrong. AND when I upload the same zip file which was parsed OK before now after the manual reprocessing, it will fail to parse. So is seems the whole upload is somehow corrupting my oasis state. AND it is reproducible with the gitlab-registry.mpcdf.mpg.de/nomad-lab/nomad-fair:stable images as well, so after all nothing wrong with my docker build (probaly).

I’ll try to get the original upload (and permission to share it). BTW can I get the original byte-for-byte uploaded file for specific upload_id from the database? This is not possible from the GUI as was discussed previously and tracked in Download whole upload button (#514) · Issues · nomad-lab / nomad-FAIR · GitLab, but can I do it somehow from the nomad admin tool?

mscheidgen · April 26, 2021, 9:56am

When NOMAD receives a .zip or .tar file, it gets decompressed and we do not keep the original uploaded files. When published, the data gets compressed again, but in a standardised NOMAD way (two .zip files for public and embargo data). When you reprocess, these NOMAD .zip files get extracted again to run the processing. Nomad has two directories (mounted /app/.volumes/fs/public|staging) with compressed/archived published uploads and extracted non published (staging) uploads. Maybe you can use this info to do some byte-by-byte comparisons of files. We did not yet get around implementing the whole upload download button.

Pavel_Ondracka · April 26, 2021, 2:09pm

OK, so I found the upload_id folder in /app/.volumes/fs/public/ and there are five files:

-NC3zeQuRTOlGO6wkrV7qg/archive-public.msg.msg
-NC3zeQuRTOlGO6wkrV7qg/archive-restricted.msg.msg
-NC3zeQuRTOlGO6wkrV7qg/raw-public.plain.zip
-NC3zeQuRTOlGO6wkrV7qg/raw-restricted.plain.zip
-NC3zeQuRTOlGO6wkrV7qg/user_metadata.pickle

So the biggest one is the raw-public.plain.zip with 10GB size. And now I have no idea what I’m looking for. There doesn’t seem to be anything special, if I downloads it and unpack I just get 10GB of VASP calculations (is the zip file supposed to have no compression btw?) How should I debug further?

mscheidgen · April 26, 2021, 2:48pm

My assumption was that the OUTCAR in the raw-public.plain.zip is somehow different from the on that use use to upload and that therefore, one causes the error and the other does not.

Pavel_Ondracka · April 26, 2021, 7:48pm

Right now its not clear if the event which actually switches the oasis from the working state to the “bad state” is related to the failing OUTCAR itself, or the failing OUTCAR is just later symptom.

If I upload the bad OUTCAR after oasis restart and even multiple times it will work OK each time, so just uploading it is not enough to trigger the “bad state”. I have to start the reprocessing of the whole upload which contained the OUTCAR (but also hundreds of other) and after that the same OUTCAR upload will be broken every time.

I’ll try to check if I can narrow down what part of the whole upload can be causing the “bad state”.

mscheidgen · April 27, 2021, 5:53am

Whats also plausible is that the VASP parser (where parts might be reused between runs) gets into a bad state. I feel stupid, that this plausible cause wasn’t popping into my mind earlier. I can run a few tests which try a sequence of OUTCARs on the same parser instance.

Could you try disabling the parser re-use and see if the problem goes away? You can add this top-level key to your nomad.yaml

process_reuse_parser: false

Pavel_Ondracka · April 27, 2021, 6:31am

process_reuse_parser: false

did the trick. Now I assume this is off by default because there are some performance implications?

Let me know if you can reproduce now that the cause is localized a bit more (I guess you still need some special OUTCAR/vasprun.xml to corrupt the parser state) and the OUTCAR attached here and the one in Another parsing failure: "TypeError: unsupported operand type(s) for +: 'int' and 'str'" · Issue #8 · nomad-coe/nomad-parser-vasp · GitHub are probably just symptoms of the bad parser state, not the cause.

If you can’t reproduce, I’ll try harder to get a permission to share the original big upload that reproducibly triggers this every time…

mscheidgen · April 27, 2021, 6:32am

If I run your OUTCAR and another OUTCAR on the same parser instance, I get a new errors in the logs. It is a different one, but it might be a different symptom for the same disease. I asked Alvin to look into this: OUTCAR parser produces more errors when reused · Issue #9 · nomad-coe/nomad-parser-vasp · GitHub