Upgrade of old Oasis

Are some caveats to be expected when upgrading from old 0.10.5 version to latest main (or develop)? Should I just build a new docker image and is it supposed to work with the old database, config and all?

No, this won’t work out of the box. You have to migrate the data and reprocess everything. In principle, you should be able to follow this documentation: Operating an OASIS - Documentation. Please read before you go on here!

Make sure to backup your old mongodb (and ideally also the files). These are the current docs on the backups: Operating an OASIS - Documentation. You need to change the database name to the one that was used in the old version (the old name is most likely nomad_fairdi).

You have to make sure that you are using the same volumes/mount the same directories for the new nomad. You basically run the new nomad on the old raw files and old archive files, but with a new migrated mongo database and a new updated elasticsearch index. The new nomad uses a different suffix on the archive files and both archive version can exist at the same time. During reprocessing, it basically creates a “copy” of the archive. The raw-files are fully shared between versions. I won’t recommend it, but in principle you can run both versions at the same time.

I followed the docs and so far everything seems to work OK, except I’m now facing an issue in the last step.

I imported the old db and confirmed I see all the uploads with
nomad admin uploads ls

Than I start reprocessing of a single upload as a test and I end with an error

ERROR    nomad.processing     2023-03-24T09:41:32 process failed with exception
  - exception: Traceback (most recent call last):
      File "/usr/local/lib/python3.7/site-packages/nomad/processing/base.py", line 860, in proc_task
        rv = unwrapped_func(proc, *args, **kwargs)
      File "/usr/local/lib/python3.7/site-packages/nomad/processing/data.py", line 1555, in process_upload
        path_filter, only_updated_files)
      File "/usr/local/lib/python3.7/site-packages/nomad/processing/data.py", line 1589, in _process_upload_local
        updated_files = self.update_files(file_operations, only_updated_files)
      File "/usr/local/lib/python3.7/site-packages/nomad/processing/data.py", line 1717, in update_files
        self.upload_files.to_staging_upload_files(create=True)
      File "/usr/local/lib/python3.7/site-packages/nomad/files.py", line 1331, in to_staging_upload_files
        raw_zip_file = self.raw_zip_file_object()
      File "/usr/local/lib/python3.7/site-packages/nomad/files.py", line 1272, in raw_zip_file_object
        self.access  # Invoke to initialize
      File "/usr/local/lib/python3.7/site-packages/nomad/files.py", line 1253, in access
        raise KeyError('Inconsistency: both public and restricted files found')
    KeyError: 'Inconsistency: both public and restricted files found'
  - exception_hash: jAIsvzrJWE77ZE-ImPEr6jdLwhUy
  - nomad.commit: 
  - nomad.deployment: oasis
  - nomad.processing.error: 'Inconsistency: both public and restricted files found'
  - nomad.processing.main_author: 66743202-925f-4fe6-80f0-27f0eb4e17dc
  - nomad.processing.main_author_name: Ondračka Pavel
  - nomad.processing.proc: Upload
  - nomad.processing.process: process_upload
  - nomad.processing.process_status: RUNNING
  - nomad.processing.process_worker_id: Y3y0S5k6SNCvv_ulYJzclA
  - nomad.processing.upload_name: aTiN.tar.gz
  - nomad.service: unknown nomad service
  - nomad.upload_id: 4oO8qNHUSe-GFgX1oNB8UQ
  - nomad.version: 1.1.8.dev0+ge8e774f58.d20230228

BTW docs claim the command is nomad admin uploads reprocess, but it should be only process.

Let me try to explain the error. One of the v0.x → v1.x changes is that we get rid of per entry embargos. Now only a full upload can have and embargo or not. In the file system (.volumes/fs/public/**) files are stored as (raw-public.plain.zip, raw-restricted.plain.zip), (archive-public.msg.msg, archive-restricted.msg.msg). Here the restricted refers to with embargo and public without. In v0.x all four files were present (even if empty). Now in v1.x there is a check that only one pair must exist.

I am not sure if the migrate-mongo has removed the other files. I assume you never used embargos on anything? Can you check, if the restricted files are still there? You could simply remove all restricted files (check if they are empty first).

Any sorry for the reprocess, we renamed it at some point to process.

OK, so there are always 5 files, like here:

-rw-r–r-- 1 nomad nomad 35M Nov 30 2021 archive-public.msg.msg
-rw-r–r-- 1 nomad nomad 32 Nov 30 2021 archive-restricted.msg.msg
-rw-r–r-- 1 nomad nomad 3.7G Nov 30 2021 raw-public.plain.zip
-rw-r–r-- 1 nomad nomad 34M Nov 30 2021 raw-restricted.plain.zip
-rw-r–r-- 1 nomad nomad 73 Nov 30 2021 user_metadata.pickle

In this Oasis there were never used any per-entry embargos, however the raw-restricted.plain.zip still contains valuable files. There are obviously some POTCARs which I don’t care about, but there are also files from folders where no mainfile was detected (either because no parser was available at that time or because its just some supplementary data which is still valuable). So I can’t delete it.

Yes, you are right. We should merge the files instead. But the archive-restricted.msg.msg are all empty, right (32 bytes is empty for this file format)?

I will dig a bit and try to find the right scripts for the merge.

Yeah, every archive-restricted.msg.msg is just 32 bytes of the same binary data.

Unfortunately there are not scripts to do the merge. I was remembering wrong. I thought we did something like this, but we did not. We just moved the unwanted to file to a backup. Our logic was, it was not public before, why make it public now. I guess for your Oasis the situation is different.

There is obviously stuff like zipmerge that could be used. But still requires someone/something to go through all the directories. I don’t know how much uploads you have? I am afraid there is no much utility for a script besides your Oasis.

Sorry that this isn’t sorted out better, but you are basically the first (and only) rare cases for this migration. Let us know, when you need help on a merge script.

OK, don’t worry, but just to be clear, I do merge contents of raw-restricted.plain.zip into raw-public.plain.zip, than I delete all archive-restricted.msg.msg and the original raw-restricted.plain.zip and do the reprocessing.

Am I assuming correctly that at that point there is no turning back and the possibility to run the old and new oasis version on top of the same raw data would be lost (up to now the old version is still working fine with the old database and the raw files, would it still be the case afterwards)?

Yes you merge raw-restricted.plain.zip into raw-public.plain.zip.

If you want to be safe, you ignore the archive-* files and leave them as they are. The new version adds “suffix” and is working with file names archive-*-v1.msg.msg. It will ignore your existing archive files. So just keep all the archive files if you want. This will keep the old version working, even after the processing, because the results of the processing are stored in the new *-v1.msg.msg files. The version “suffix” is also used for the non published archives. The whole point for the “suffix” was too make both versions work in parallel.

I am not a 100% how much the old version will miss the restricted raw files. But it should basically keep working. If nomad needs a file, it looks into the public zip and if the user has the rights also in the restricted zip. Possible that you get some errors when you look at your own uploads, not sure exactly.
But, You will basically just remove empty (after merge) .zip files that could easily be re-created.

OK, further we go, so after fixing files for one upload and running reprocessing again, the parsing fails with:

ERROR    nomad.processing     2023-03-25T13:16:51 process failed
  - exception: Traceback (most recent call last):
      File "/usr/local/lib/python3.7/site-packages/nomad/processing/data.py", line 1142, in parsing
        parser = parser.__class__()
    TypeError: __init__() missing 1 required positional argument: 'parser_class_name'
  - exception_hash: 4KSIzXlz0olDx7pukivEmraw4dW4
  - nomad.commit: 
  - nomad.deployment: oasis
  - nomad.entry_id: buGWQG4lNJuvsznlcojC_N6fYu9J
  - nomad.mainfile: aTiN/Ti1.1N1.0-500/VASP0Krelax/vasprun.xml
  - nomad.processing.error: __init__() missing 1 required positional argument: 'parser_class_name'
  - nomad.processing.errors: could not re-create parser instance
  - nomad.processing.logger: nomad.processing
  - nomad.processing.parser: parsers/vasp
  - nomad.processing.proc: Entry
  - nomad.processing.process: process_entry
  - nomad.processing.process_status: RUNNING
  - nomad.processing.process_worker_id: Af1HKnZ9S4ezOOhPPmzJzA
  - nomad.processing.step: parsers/vasp
  - nomad.service: unknown nomad service
  - nomad.upload_id: 4oO8qNHUSe-GFgX1oNB8UQ
  - nomad.version: 1.1.8.dev0+ge8e774f58.d20230228

This seems independent of the specific parser (so far tested VASP or Wien2k).

Have you explicitly disabled process.reuse_parser: False in your nomad.yaml. The code that tries to create a new parser instance because of the disabled reuse looks bad. Its probably not right, but we also did not catch this before, because the setting is rarely used. I guess, at some point we had some troubles with re-using the same parser over and over again and introduced this.

You could try to set process.reuse_parser: True (or simply remove from the config) and try again. Otherwise, you need wait until we investigate fix/this.

The fix is already in the pipeline: Draft: Resolve "Processing without reuse parsers does not work" (!1193) · Merge requests · nomad-lab / nomad-FAIR · GitLab

Indeed, I copied the process_reuse_parser: false from the old nomad.yaml. There were some random failures in the past in this specific Oasis, see Wrong parsing in oasis (possibly some bad settings somewhere?) for the thread from back than. But maybe this got fixed in the meantime, so I’ll try reprocessing with the default and see if maybe this workaround is no longer needed.

I’ve been running the mass reprocessing with mixed success so far, so maybe just few comments on the complications I run into, hopefully might be useful for someone else doing upgrade as well:

  • The v1.1 oasis seems to be somewhat more resource heavy that the 0.10 one, I’ve run into several limits, specifically the 1800sec celery timeout limit (that was easy to bump up in the config) and number of open files limit (that took me some time to realize I need to fix the limits for the docker service not just the limits in the host VM OS). All of this happens only for large uploads (usually tens of GBs and thousands of entries).
  • as a result of the zipmerge (that was done previously and I run it with the default settings) the raw archives are now compressed which is also not so good for the speed (I believe the raw zip files are not compressed by default, right?), maybe I need repacking to clean this up?
  • The memory consumption seems to be also higher, IIRC previously it were the parsers using most of the memory (IIRC like Vasp parser running on 1GB vasprun xml needing over 10GB memory), now it seems like some step after the parsing is taking a lot of memory (I’ve seen up to 30GB for some large uploads), but I was not able to figure out what is that.
  • Due to the above problems, I run out of disc space once, because all the failed processings leave files in the staging directory, but that was my bad for not keeping a closer eye. :slight_smile:
  • The only issue I was not able to workaround is that I have some occasional elastic connection problems, but it looks like only for specific uploads (most processing works fine), so I have no idea what is going on here, might be also some limits issues, like too big requests for the large uploads or something? Any ideas here would be appreciated. The logs look like this:
WARNING  elasticsearch        2023-03-30T00:35:16 GET http://elastic:9200/ [status:N/A request:0.002s]
  - lineno: 294
  - nomad.commit: 
  - nomad.deployment: oasis
  - nomad.service: unknown nomad service
  - nomad.version: 1.1.8.dev0+ge8e774f58.d20230228
  - process: 26
  - stack_trace: Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 175, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 95, in create_connection
    raise err
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 85, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/elasticsearch/connection/http_urllib3.py", line 256, in perform_request
    method, url, body, retries=Retry(False), headers=request_headers, **kw
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 788, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/retry.py", line 525, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/local/lib/python3.7/site-packages/urllib3/packages/six.py", line 770, in reraise
    raise value
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 710, in urlopen
    chunked=chunked,
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 398, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 239, in request
    super(HTTPConnection, self).request(method, url, body=body, headers=headers)
  File "/usr/local/lib/python3.7/http/client.py", line 1281, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1327, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1276, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1036, in _send_output
    self.send(msg)
  File "/usr/local/lib/python3.7/http/client.py", line 976, in send
    self.connect()
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 205, in connect
    conn = self._new_conn()
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 187, in _new_conn
    self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f841c147ed0>: Failed to establish a new connection: [Errno 111] Connection refused

  - thread_name: MainThread
ERROR    celery.utils.dispatc 2023-03-30T00:35:16 Signal handler <function setup at 0x7f8449a97ef0> raised: ConnectionError('N/A', '<urllib3.connection.HTTPConnection object at 0x7f841c147ed0>: Failed to establish a new connection: [Errno 111] Connection refused', NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f841c147ed0>: Failed to establish a new connection: [Errno 111] Connection refused'))
  - lineno: 293
  - nomad.commit: 
  - nomad.deployment: oasis
  - nomad.service: unknown nomad service
  - nomad.version: 1.1.8.dev0+ge8e774f58.d20230228
  - process: 26
  - stack_trace: Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 175, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 95, in create_connection
    raise err
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 85, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/elasticsearch/connection/http_urllib3.py", line 256, in perform_request
    method, url, body, retries=Retry(False), headers=request_headers, **kw
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 788, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/retry.py", line 525, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/local/lib/python3.7/site-packages/urllib3/packages/six.py", line 770, in reraise
    raise value
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 710, in urlopen
    chunked=chunked,
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 398, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 239, in request
    super(HTTPConnection, self).request(method, url, body=body, headers=headers)
  File "/usr/local/lib/python3.7/http/client.py", line 1281, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1327, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1276, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.7/http/client.py", line 1036, in _send_output
    self.send(msg)
  File "/usr/local/lib/python3.7/http/client.py", line 976, in send
    self.connect()
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 205, in connect
    conn = self._new_conn()
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 187, in _new_conn
    self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f841c147ed0>: Failed to establish a new connection: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/celery/utils/dispatch/signal.py", line 288, in send
    response = receiver(signal=self, sender=sender, **named)
  File "/usr/local/lib/python3.7/site-packages/nomad/processing/base.py", line 61, in setup
    infrastructure.setup()
  File "/usr/local/lib/python3.7/site-packages/nomad/infrastructure.py", line 72, in setup
    setup_elastic()
  File "/usr/local/lib/python3.7/site-packages/nomad/infrastructure.py", line 102, in setup_elastic
    create_v1_indices()
  File "/usr/local/lib/python3.7/site-packages/nomad/metainfo/elasticsearch_extension.py", line 910, in create_indices
    entry_index.create_index(upsert=True)  # TODO update the existing v0 index
  File "/usr/local/lib/python3.7/site-packages/nomad/metainfo/elasticsearch_extension.py", line 480, in create_index
    if not self.elastic_client.indices.exists(index=self.index_name):
  File "/usr/local/lib/python3.7/site-packages/elasticsearch/client/utils.py", line 347, in _wrapped
    return func(*args, params=params, headers=headers, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/elasticsearch/client/indices.py", line 372, in exists
    "HEAD", _make_path(index), params=params, headers=headers
  File "/usr/local/lib/python3.7/site-packages/elasticsearch/transport.py", line 417, in perform_request
    self._do_verify_elasticsearch(headers=headers, timeout=timeout)
  File "/usr/local/lib/python3.7/site-packages/elasticsearch/transport.py", line 606, in _do_verify_elasticsearch
    raise error
  File "/usr/local/lib/python3.7/site-packages/elasticsearch/transport.py", line 570, in _do_verify_elasticsearch
    "GET", "/", headers=headers, timeout=timeout
  File "/usr/local/lib/python3.7/site-packages/elasticsearch/connection/http_urllib3.py", line 280, in perform_request
    raise ConnectionError("N/A", str(e), e)
elasticsearch.exceptions.ConnectionError: ConnectionError(<urllib3.connection.HTTPConnection object at 0x7f841c147ed0>: Failed to establish a new connection: [Errno 111] Connection refused) caused by: NewConnectionError(<urllib3.connection.HTTPConnection object at 0x7f841c147ed0>: Failed to establish a new connection: [Errno 111] Connection refused)

  - thread_name: MainThread
  • On the parser side all is looking fine so far, just few minor regressions, sent one PR already and I’ll open issues for the remaining ones if I can’t figure them out on my own.

Thank you very much for the feed back.

  • The parser matching got more complicated and we have almost twice the parsers. There are two things in the making. First, we try to parrallize the matching. Second, we work on enabling/disabling parsers as part of the config. Both things are still in PR and will probably release towards end of April. You can disable the rematching on reprocessing. NOMAD will not look for new entries and new parsers and just apply what has been established in earlier processings. In your nomad.yaml, the key is rematch_published: False.

  • the stuff “after the parsing” could be some additional indexing. It might be worthwhile to disable the materials indexing. In your nomad.yaml, the key is process.index_materials: False. We haven’t really advertised the materials search much. The indexing isn’t working very well (see ES comments below). If you really need/want the materials search, you could create the materials index later in a separate step (e.g. nomad admin uploads index).

  • in May we will do a full processing of the theory data. I guess this is where we will also run into problems and probably produce more fixes.

  • once the raw files are compressed, I don’t think there is harmful performance impact. The compressing itself was the “hard” part.

  • if you upload processing is failing, the reprocess just stops. The is the nomad admin clean command. Handle with care.

  • I think the ES problems will go away when you disable the materials index. This invokes very heavy ES operations, especially with many entries in an upload. It probably just overburdens the ES server and cause outages due to recovery operations.

Here are some instructions for the clean command.

nomad admin clean --skip-entries --skip-es --staging-too --dry

This will look for upload folders that are no in mongo and will look for staging uploads that have a published counter part. Without the --dry you should get the option to delete the staging ones. Do the --dry first and check.