Extremely slow startup of the app container

SBehnle · January 24, 2024, 1:38pm

Dear NOMAD developers,
i am currently trying to deploy a NOMAD oasis for our institute and am facing issues with the app container. The container takes up to 15 minutes to finally come up for no apparent reason:

[root@u-030-s007 nomad]# docker compose logs app
nomad_oasis_app | [2024-01-24 12:07:44 +0100] [7] [INFO] Starting gunicorn 21.2.0
nomad_oasis_app | [2024-01-24 12:07:44 +0100] [7] [INFO] Listening at: http://0.0.0.0:8000 (7)
nomad_oasis_app | [2024-01-24 12:07:44 +0100] [7] [INFO] Using worker: uvicorn.workers.UvicornWorker
nomad_oasis_app | [2024-01-24 12:07:44 +0100] [20] [INFO] Booting worker with pid: 20
nomad_oasis_app | [2024-01-24 12:07:45 +0100] [21] [INFO] Booting worker with pid: 21
nomad_oasis_app | [2024-01-24 12:07:45 +0100] [22] [INFO] Booting worker with pid: 22
nomad_oasis_app | [2024-01-24 12:07:45 +0100] [23] [INFO] Booting worker with pid: 23
nomad_oasis_app | [2024-01-24 12:21:03 +0100] [23] [INFO] Started server process [23]
nomad_oasis_app | [2024-01-24 12:21:03 +0100] [23] [INFO] Waiting for application startup.
nomad_oasis_app | [2024-01-24 12:21:03 +0100] [21] [INFO] Started server process [21]
nomad_oasis_app | [2024-01-24 12:21:03 +0100] [21] [INFO] Waiting for application startup.
nomad_oasis_app | [2024-01-24 12:21:03 +0100] [22] [INFO] Started server process [22]
nomad_oasis_app | [2024-01-24 12:21:03 +0100] [22] [INFO] Waiting for application startup.
nomad_oasis_app | [2024-01-24 12:21:03 +0100] [20] [INFO] Started server process [20]
nomad_oasis_app | [2024-01-24 12:21:03 +0100] [20] [INFO] Waiting for application startup.
nomad_oasis_app | [2024-01-24 12:22:13 +0100] [21] [INFO] Application startup complete.
nomad_oasis_app | [2024-01-24 12:22:14 +0100] [20] [INFO] Application startup complete.
nomad_oasis_app | [2024-01-24 12:22:16 +0100] [22] [INFO] Application startup complete.
nomad_oasis_app | [2024-01-24 12:22:18 +0100] [23] [INFO] Application startup complete.

During the first minute of startup, docker top shows quite some load but that rapidly drops and the container just seems to idle (no CPU load, no disk I/O, no network I/O, enough RAM available).

While trying to track down the issue, i attached strace to the python process and observed that it is doing file system syscalls:

…
pselect6(4, [3], , , {tv_sec=1, tv_nsec=0}, NULL) = 0 (Timeout)
newfstatat(6, “”, {st_mode=S_IFREG|600, st_size=0, …}, AT_EMPTY_PATH) = 0
newfstatat(7, “”, {st_mode=S_IFREG|600, st_size=0, …}, AT_EMPTY_PATH) = 0
newfstatat(8, “”, {st_mode=S_IFREG|600, st_size=0, …}, AT_EMPTY_PATH) = 0
newfstatat(9, “”, {st_mode=S_IFREG|600, st_size=0, …}, AT_EMPTY_PATH) = 0
…
The moment the application recovers and finally starts up, these syscalls change to
…
pselect6(4, [3], , , {tv_sec=1, tv_nsec=0}, NULL) = 0 (Timeout)
newfstatat(6, “”, {st_mode=S_IFREG|001, st_size=0, …}, AT_EMPTY_PATH) = 0
newfstatat(7, “”, {st_mode=S_IFREG|001, st_size=0, …}, AT_EMPTY_PATH) = 0
newfstatat(8, “”, {st_mode=S_IFREG|001, st_size=0, …}, AT_EMPTY_PATH) = 0
newfstatat(9, “”, {st_mode=S_IFREG|001, st_size=0, …}, AT_EMPTY_PATH) = 0
…
accompanied with an increase in CPU usage. After that, the container successfully starts and NOMAD becomes usable.
During the whole 14 minute startup delay, the machine is effectively idling (no CPU load, no disk or network I/O).

The machine NOMAD is running in is a VM with four cores and 16 GB RAM hosted on rather recent hardware.
The NOMAD volumes are located on a large NFS mount. Might this cause the problem?

The underlying docker engine is docker 25.0 on Rocky Linux 9.3:

[root@u-030-s007 fs]# docker info
Client: Docker Engine - Community
Version: 25.0.0
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc.)
Version: v0.12.1
Path: /usr/libexec/docker/cli-plugins/docker-buildx
compose: Docker Compose (Docker Inc.)
Version: v2.24.1
Path: /usr/libexec/docker/cli-plugins/docker-compose

Server:
Containers: 7
Running: 7
Paused: 0
Stopped: 0
Images: 12
Server Version: 25.0.0
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Using metacopy: false
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: systemd
Cgroup Version: 2
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
Swarm: inactive
Runtimes: runc io.containerd.runc.v2
Default Runtime: runc
Init Binary: docker-init
containerd version: a1496014c916f9e62104b33d1bb5bd03b0858e59
runc version: v1.1.11-0-g4bccb38
init version: de40ad0
Security Options:
seccomp
Profile: builtin
cgroupns
Kernel Version: 5.14.0-362.13.1.el9_3.x86_64
Operating System: Rocky Linux 9.3 (Blue Onyx)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 15.56GiB
Name: u-030-s007
ID: 17341a6c-20f5-4a76-a0fd-8cf7ecddaf09
Docker Root Dir: /dockerdata/volumes
Debug Mode: false
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false

SELinux is enabled, but there are no SELinux alerts whatsoever.

Simply disabling the nexus parser as proposed in Issue with running app container (NOMAD oasis) is not an option.

Is there any way to further narrow down and eliminate the issue? What am i doing wrong?

Btw., some container feature the environment variable NOMAD_FS_EXTERNAL_WORKING_DIRECTORY, typically set to $PWD, for which i was unable to find any documentation.

Best
Stefan

These are my (redacted) docker-compose.yaml and nomad.yaml (first post, unable to create attachments):

version: "3"

services:
  #  no internal keycloak!
  # keycloak user management
  #keycloak:
  #  restart: unless-stopped
  # image: jboss/keycloak:16.1.1
  # container_name: nomad_oasis_keycloak
  # environment:
  #   - PROXY_ADDRESS_FORWARDING=true
  #   - KEYCLOAK_USER=admin
  #   - KEYCLOAK_PASSWORD=password
  #   - KEYCLOAK_FRONTEND_URL=http://localhost/keycloak/auth
  #   - KEYCLOAK_IMPORT="/tmp/nomad-realm.json"
  #   command:
#      - "-Dkeycloak.import=/tmp/nomad-realm.json -Dkeycloak.migration.strategy=IGNORE_EXISTING -Dkeycloak.profile.feature.upload_scripts=enabled"
#   - "-Dkeycloak.import=/tmp/nomad-realm.json -Dkeycloak.migration.strategy=IGNORE_EXISTING"
# volumes:
      #     - keycloak:/opt/jboss/keycloak/standalone/data
      #- ./configs/nomad-realm.json:/tmp/nomad-realm.json
      # healthcheck:
      #test:
      # - "CMD"
      # - "curl"
      # - "--fail"
      # - "--silent"
      # - "http://127.0.0.1:9990/health/live"
      #interval: 10s
      #timeout: 10s
      #retries: 30
      #start_period: 30s

  # broker for celery
  rabbitmq:
    restart: unless-stopped
    image: rabbitmq:3.11.5 # replacable by "latest"??
    container_name: nomad_oasis_rabbitmq
    environment:
      - TZ=Europe/Berlin
      - RABBITMQ_ERLANG_COOKIE=<snip> # SECURITY!!! CHANGE!!!
      - RABBITMQ_DEFAULT_USER=rabbitmq
      - RABBITMQ_DEFAULT_PASS=<snip>              # SECURITY!!! CHANGE?
      - RABBITMQ_DEFAULT_VHOST=/
    volumes:
      - rabbitmq:/var/lib/rabbitmq
    healthcheck:
      test: ["CMD", "rabbitmq-diagnostics", "--silent", "--quiet", "ping"]
      interval: 10s
      timeout: 10s
      retries: 30
      start_period: 10s
        #networks:
        #- nomad_oasis_network

  # the search engine
  elastic:
    restart: unless-stopped
    image: docker.elastic.co/elasticsearch/elasticsearch:7.17.1 # replacable by "latest"??
    container_name: nomad_oasis_elastic
    environment:
      - TZ=Europe/Berlin
      - ES_JAVA_OPTS=-Xms512m -Xmx512m                          # change these if elastic demands more memory or crashes 
      - discovery.type=single-node
    volumes:
      - elastic:/usr/share/elasticsearch/data
    healthcheck:
      test:
        - "CMD"
        - "curl"
        - "--fail"
        - "--silent"
        - "http://elastic:9200/_cat/health"
      interval: 10s
      timeout: 10s
      retries: 30
      start_period: 60s
        #networks:
        #- nomad_oasis_network

  # the user data db
  mongo:
    restart: unless-stopped
    image: mongo:5.0.6 # replacable by "latest"??
    container_name: nomad_oasis_mongo
    environment:
      - TZ=Europe/Berlin
      - MONGO_DATA_DIR=/data/db
      - MONGO_LOG_DIR=/dev/null
    volumes:
      - mongo:/data/db             # this is a docker volume; can this be pointed to a bwSFS mount?
      - /bwsfs/NOMAD_LISAPLUS_PRODUCTION/nomad_volumes/mongo:/backup   # points to bind mount! How often does the backup run? Is it sufficient to back up to bwSFS? How to restore after complete disaster?
    command: mongod --logpath=/dev/null # --quiet
    healthcheck:
      test:
        - "CMD"
        - "mongo"
        - "mongo:27017/test"
        - "--quiet"
        - "--eval"
        - "'db.runCommand({ping:1}).ok'"
      interval: 10s
      timeout: 10s
      retries: 30
      start_period: 10s
        #networks:
        #- nomad_oasis_network
  # nomad worker (processing)
  worker:
    restart: unless-stopped
    image: gitlab-registry.mpcdf.mpg.de/nomad-lab/nomad-fair:latest #v1.2.1 # maybe fix to tested version in production!
    container_name: nomad_oasis_worker
    environment:
      TZ: Europe/Berlin
      NOMAD_SERVICE: nomad_oasis_worker
      NOMAD_RABBITMQ_HOST: rabbitmq
      NOMAD_ELASTIC_HOST: elastic
      NOMAD_MONGO_HOST: mongo
      # NOMAD_LOGSTASH_HOST: logtransfer                             # copied from recent template
      OMP_NUM_THREADS: 2                                           # adjust depending on VM size
    depends_on:
      rabbitmq:
        condition: service_healthy
      elastic:
        condition: service_healthy
      mongo:
        condition: service_healthy
    volumes:
      - ./configs/nomad.yaml:/app/nomad.yaml
      - /bwsfs/NOMAD_LISAPLUS_PRODUCTION/nomad_volumes/fs:/app/.volumes/fs                            # what is stored here???
    command: python -m celery -A nomad.processing worker -l info -Q celery
    extra_hosts:
      - keycloak.my-uni.de:1.2.3.4
        #networks:
        #- nomad_oasis_network


  # nomad app (api + proxy)
  app:
    restart: unless-stopped
    image: gitlab-registry.mpcdf.mpg.de/nomad-lab/nomad-fair:latest #v1.2.1 # same image as worker, fix to the same version
    container_name: nomad_oasis_app
    environment:
      TZ: Europe/Berlin
      NOMAD_SERVICE: nomad_oasis_app
      NOMAD_SERVICES_API_PORT: 80                                   # Do we need API access? Does it have to be port 80 or can we move to 443? How does API access work in detail? 
        #      NOMAD_FS_EXTERNAL_WORKING_DIRECTORY: "$PWD"
      NOMAD_FS_EXTERNAL_WORKING_DIRECTORY: "/bwsfs/NOMAD_LISAPLUS_PRODUCTION/"
      NOMAD_RABBITMQ_HOST: rabbitmq
      NOMAD_ELASTIC_HOST: elastic
      NOMAD_MONGO_HOST: mongo
      # NOMAD_LOGSTASH_HOST: logtransfer           # copied from recent template
      NOMAD_NORTH_HUB_HOST: north       # remove comment?
      GUNICORN_CMD_ARGS: "--access-logfile '-' --error-logfile '-' --log-level 'debug' "
    depends_on:
      rabbitmq:
        condition: service_healthy
      elastic:
        condition: service_healthy
      mongo:
        condition: service_healthy
      north:
        condition: service_started
#      keycloak:
#        condition: service_started
    volumes:
      - ./configs/nomad.yaml:/app/nomad.yaml
      - /bwsfs/NOMAD_LISAPLUS_PRODUCTION/nomad_volumes/fs:/app/.volumes/fs
    command: ./run.sh
    healthcheck:
      test:
        - "CMD"
        - "curl"
        - "--fail"
        - "--silent"
        - "http://localhost:8000/-/health"
      interval: 60s # 10s
      timeout: 10s
      retries: 30
      start_period: 60s # 10 -> 60 per https://matsci.org/t/issue-with-running-app-container-nomad-oasis/51306
    extra_hosts:
      - keycloak.my-uni.de:1.2.3.4
        #networks:
        #- nomad_oasis_network

  # nomad remote tools hub (JupyterHUB, e.g. for AI Toolkit)
  north:
    restart: unless-stopped
    image: gitlab-registry.mpcdf.mpg.de/nomad-lab/nomad-fair:latest #v1.2.1 # Same image as worker and app, fix to production version; test upgrade!!!
    container_name: nomad_oasis_north
    environment:
      TZ: Europe/Berlin
      NOMAD_SERVICE: nomad_oasis_north
      NOMAD_NORTH_DOCKER_NETWORK: nomad_oasis_network
      NOMAD_NORTH_HUB_CONNECT_IP: north
      NOMAD_NORTH_HUB_IP: "0.0.0.0"
      NOMAD_NORTH_HUB_HOST: north
      NOMAD_SERVICES_API_HOST: app
      NOMAD_FS_EXTERNAL_WORKING_DIRECTORY: "/bwsfs/NOMAD_LISAPLUS_PRODUCTION/" #"$PWD"
      NOMAD_RABBITMQ_HOST: rabbitmq
      NOMAD_ELASTIC_HOST: elastic
      NOMAD_MONGO_HOST: mongo
        #    depends_on:
      #      keycloak:
      #  condition: service_started
      # app:
      #   condition: service_started
    volumes:
      - ./configs/nomad.yaml:/app/nomad.yaml
      - /bwsfs/NOMAD_LISAPLUS_PRODUCTION/nomad_volumes/fs:/app/.volumes/fs
      - /var/run/docker.sock:/var/run/docker.sock
    user: '1000:991'                                 # replace by the output of "getent passwd | grep nomad" and getent group | grep docker"; 
    command: python -m nomad.cli admin run hub
    healthcheck:
      test:
        - "CMD"
        - "curl"
        - "--fail"
        - "--silent"
        - "http://localhost:8081/nomad-oasis/north/hub/health"
      interval: 10s
      timeout: 10s
      retries: 30
      start_period: 10s
    extra_hosts:
      - keycloak.my-uni.de:1.2.3.4
        #networks:
        #- nomad_oasis_network


  # nomad logtransfer
  # == DOES EXACTLY WHAT????? ==
  # to enable the logtransfer service run "docker compose --profile with_logtransfer up"
  logtransfer:
    restart: unless-stopped
    image: gitlab-registry.mpcdf.mpg.de/nomad-lab/nomad-fair:latest #v1.2.1
    container_name: nomad_oasis_logtransfer
    environment:
      TZ: Europe/Berlin
      NOMAD_SERVICE: nomad_oasis_logtransfer
      NOMAD_ELASTIC_HOST: elastic
      NOMAD_MONGO_HOST: mongo
    depends_on:
      elastic:
        condition: service_healthy
      mongo:
        condition: service_healthy
    volumes:
      - ./configs/nomad.yaml:/app/nomad.yaml
      - /bwsfs/NOMAD_LISAPLUS_PRODUCTION/nomad_volumes/fs:/app/.volumes/fs
    command: python -m nomad.cli admin run logtransfer
    profiles: ["with_logtransfer"]
      #networks:
      #- nomad_oasis_network

  # nomad proxy (a reverse proxy for nomad)
  proxy:
    restart: unless-stopped
    image: nginx:latest #1.13.9-alpine                        # upgradable? latest-alpine?
    container_name: nomad_oasis_proxy
    command: nginx -g 'daemon off;'
    environment:
      - TZ=Europe/Berlin
    volumes:
      - ./configs/nginx.conf:/etc/nginx/conf.d/default.conf:ro
      - ./ssl/:/etc/ssl/:ro
    depends_on:
      #      keycloak:
      #  condition: service_healthy
      app:
        condition: service_healthy #service_started
      worker:
        condition: service_started # TODO: service_healthy
      north:
        condition: service_healthy
    ports:
      - "80:80"     # standard http, try to get rid of...
      - "443:443"   # standard https, try to handle all traffic via https
        #networks:
        #- nomad_oasis_network

volumes:
  mongo:
    name: "nomad_oasis_mongo"
  elastic:
    name: "nomad_oasis_elastic"
  rabbitmq:
    name: "nomad_oasis_rabbitmq"
      #  keycloak:
      #    name: "nomad_oasis_keycloak"

networks:
  default:
    name: nomad_oasis_network

      #networks:
      #  nomad_oasis_network:
      #    driver: bridge

services:
  #  api_host: 'localhost'
  api_host: 'nomad.my-uni.de'
  api_port: 443
  api_base_path: '/nomad-oasis'
  api_timeout: 6000
  https: True
  https_upload: True
  admin_user_id: <snip> # TODO replace  
    #  aitoolkit_enabled: True
  console_log_level: 10
  upload_limit: 100000

oasis:
  is_oasis: True
  uses_central_user_management: False



    # recreate key!!    
north:
  enabled: True
  jupyterhub_crypt_key: <snip>

keycloak:
  server_url: 'https://keycloak.my-uni.de/'
  public_server_url: 'https://keycloak.my-uni.de/'
  realm_name: <snip>
  username: <snip>
  password: <snip>
  client_id: <snip>
  client_secret: <snip>
    #  redirect_uri: 'https://nomad.my-uni.de/nomad-oasis/gui/*'
# client_id: 'nomad_public'

meta:
  deployment: 'oasis'
  deployment_url: 'https://nomad.my-uni.de/api'
  maintainer_email: <snip>

mongo:
    db_name: nomad_oasis_v1

elastic:
    entries_index: nomad_oasis_entries_v1
    materials_index: nomad_oasis_materials_v1

# see https://matsci.org/t/issue-with-running-app-container-nomad-oasis/51306
#plugins:
#  exclude:
#    - parsers/nexus

mscheidgen · January 24, 2024, 2:43pm

Thanks for the detailed report.

What you describe is not normal. There is some start-up time, but it should be much lower than 15min and it should be accompanied by CPU-usage. The heavy things (e.g. the mentioned nexus initialisation) should happen after INFO: Started server process [1], INFO: Waiting for application startup.. We have no prior experience with this particular problem.

I assume, this is only happening for the “app” service? You can try to not use gunicorn to see if this changes anything. Change the “services.app.command” to

services:
  app:
    command: ["python", "-m", "nomad.cli", "admin", "run", "app", "--with-gui", "--host", "0.0.0.0"]

This is not a change you want to use in production, but it might help use to pin-point the problem. Gunicorn runs multiple instances of the API and does some load-balancing between the instances. Thats why you see all the logs 4 times. Maybe it bookting the 4 workings causes the issue somehow. At least the 15 min are between Booting worker with pid and Started server process, where we would only expect a few seconds.

You can also try to run nomad without docker-compose just to see if this changes anything:

docker run --rm gitlab-registry.mpcdf.mpg.de/nomad-lab/nomad-fair:latest python -m nomad.cli admin run app --with-gui

This will eventually fail, because there are no database backends to connect to, but it should do most of the start up before. Especially the INFO: Started server process [1], INFO: Waiting for application startup. should show pretty quickly (was ~5 sec. on my shabby laptop).

If this does not help, tracing the imports on startup might show us something:

docker run --rm gitlab-registry.mpcdf.mpg.de/nomad-lab/nomad-fair:latest python -v -m nomad.cli admin run app --with-gui

This will produce a lot of output really fast, but maybe you can spot a single import consuming the 15min.

SBehnle · January 24, 2024, 4:11pm

Thanks for the quick response. Unfortunately, changing the command did not resolve the issue:
The application startup is quite fast now

[root@u-030-s007 nomad]# docker compose logs -t app
nomad_oasis_app | 2024-01-24T15:27:17.961820071Z INFO: Started server process [1]
nomad_oasis_app | 2024-01-24T15:27:17.961912842Z INFO: Waiting for application startup.
nomad_oasis_app | 2024-01-24T15:28:25.961886182Z INFO: Application startup complete.
nomad_oasis_app | 2024-01-24T15:28:25.964435599Z INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
nomad_oasis_app | 2024-01-24T15:28:59.935719080Z INFO: 127.0.0.1:41880 - “GET /-/health HTTP/1.1” 200 OK
nomad_oasis_app | 2024-01-24T15:29:59.984580987Z INFO: 127.0.0.1:55784 - “GET /-/health HTTP/1.1” 200 OK
nomad_oasis_app | 2024-01-24T15:31:00.029415111Z INFO: 127.0.0.1:51424 - “GET /-/health HTTP/1.1” 200 OK
nomad_oasis_app | 2024-01-24T15:32:00.074138556Z INFO: 127.0.0.1:60658 - “GET /-/health HTTP/1.1” 200 OK
nomad_oasis_app | 2024-01-24T15:33:00.127099124Z INFO: 127.0.0.1:53894 - “GET /-/health HTTP/1.1” 200 OK
but startup is still delayed:

[root@u-030-s007 nomad]# docker compose ps
NAME                   IMAGE                                                      COMMAND                  SERVICE    CREATED          STATUS                    PORTS
nomad_oasis_app        gitlab-registry.mpcdf.mpg.de/nomad-lab/nomad-fair:latest   "python -m nomad.cli…"   app        51 minutes ago   Up 50 minutes (healthy)   8000/tcp, 9000/tcp
nomad_oasis_elastic    docker.elastic.co/elasticsearch/elasticsearch:7.17.1       "/bin/tini -- /usr/l…"   elastic    51 minutes ago   Up 51 minutes (healthy)   9200/tcp, 9300/tcp
nomad_oasis_mongo      mongo:5.0.6                                                "docker-entrypoint.s…"   mongo      51 minutes ago   Up 51 minutes (healthy)   27017/tcp
nomad_oasis_north      gitlab-registry.mpcdf.mpg.de/nomad-lab/nomad-fair:latest   "python -m nomad.cli…"   north      51 minutes ago   Up 51 minutes (healthy)   8000/tcp, 9000/tcp
nomad_oasis_proxy      nginx:latest                                               "/docker-entrypoint.…"   proxy      51 minutes ago   Up 35 minutes             0.0.0.0:80->80/tcp, 0.0.0.0:443->443/tcp
nomad_oasis_rabbitmq   rabbitmq:3.11.5                                            "docker-entrypoint.s…"   rabbitmq   51 minutes ago   Up 51 minutes (healthy)   4369/tcp, 5671-5672/tcp, 15691-15692/tcp, 25672/tcp
nomad_oasis_worker     gitlab-registry.mpcdf.mpg.de/nomad-lab/nomad-fair:latest   "python -m celery -A…"   worker     51 minutes ago   Up 50 minutes             8000/tcp, 9000/tcp

The proxy starts immediately after the app is healty, and there have never been any issues with the proxy.
I fired it up at 2024-01-24T15:13:39, so the first output of app is now delayed by ~15 minutes
I will try the other proposals immediately.

Update: tried the last advise and it is now effectively stuck at

[root@u-030-s007 app]# docker run --rm gitlab-registry.mpcdf.mpg.de/nomad-lab/nomad-fair:latest 
python -v -m nomad.cli admin run app --with-gui | tee app.log
...(~5000 lines)
import 'optimade.server.exception_handlers' # <_frozen_importlib_external.SourceFileLoader object at 0x7f773bb3fdf0>
# /usr/local/lib/python3.9/site-packages/optimade/server/__pycache__/middleware.cpython-39.pyc matches /usr/local/lib/python3.9/site-packages/optimade/server/middleware.py
# code object from '/usr/local/lib/python3.9/site-packages/optimade/server/__pycache__/middleware.cpython-39.pyc'
import 'optimade.server.middleware' # <_frozen_importlib_external.SourceFileLoader object at 0x7f773afcb100>
# /usr/local/lib/python3.9/site-packages/optimade/server/routers/__pycache__/info.cpython-39.pyc matches /usr/local/lib/python3.9/site-packages/optimade/server/routers/info.py
# code object from '/usr/local/lib/python3.9/site-packages/optimade/server/routers/__pycache__/info.cpython-39.pyc'
import 'optimade.server.routers.info' # <_frozen_importlib_external.SourceFileLoader object at 0x7f773a66cee0>
# /usr/local/lib/python3.9/site-packages/optimade/server/routers/__pycache__/landing.cpython-39.pyc matches /usr/local/lib/python3.9/site-packages/optimade/server/routers/landing.py
# code object from '/usr/local/lib/python3.9/site-packages/optimade/server/routers/__pycache__/landing.cpython-39.pyc'
import 'optimade.server.routers.landing' # <_frozen_importlib_external.SourceFileLoader object at 0x7f773aacb3a0>
# /usr/local/lib/python3.9/site-packages/optimade/server/routers/__pycache__/versions.cpython-39.pyc matches /usr/local/lib/python3.9/site-packages/optimade/server/routers/versions.py
# code object from '/usr/local/lib/python3.9/site-packages/optimade/server/routers/__pycache__/versions.cpython-39.pyc'
import 'optimade.server.routers.versions' # <_frozen_importlib_external.SourceFileLoader object at 0x7f773aaf0580>

with no change for several minutes and no cpu load.

Edit 2:
After that, it was briefly frozen (~30 seconds) at

import 'toposort' # <_frozen_importlib_external.SourceFileLoader object at 0x7f77331d59d0>

then it continued to load nomad.metainfo.nexus, after which it crashed as predicted due to the missing database.

Edit3:
Behaviour is reproducible (at least on this machine), again stuck at import 'optimade.server.routers.versions'

Edit4:
This behaviour is also reprodicible on two other Rocky 9.3 servers, one running Docker 24.0.7 and the other one running Docker 25.0.0 like the offender.
It is however not reproducible on my laptop running OpenSUSE 15.5 and docker 24.0.7. In the latter case, the whole command finishes in maybe one minute.

mscheidgen · January 25, 2024, 12:24pm

We can try to pin-point this further. If it stops in the middle of the optimade imports, let’s disable the optimade API (its optional and special interest anyways). The current NOMAD does not have an config option to disable it. I created a branch with said config options to disable optional apis.

docker run -e NOMAD_SERVICES_OPTIMADE_ENABLED=false --rm gitlab-registry.mpcdf.mpg.de/nomad-lab/nomad-fair:1866-config-option-to-disable-optimade-api python -v -m nomad.cli admin run app --with-gui

Node the tag references my branch and the env var disabling one of the new config option.

Currently, I have no idea why it stops at this particular place. I had a brief look at the respective code, but these is nothing obviously suspicious in there (no network or no unusual fs calls, etc.). I tried this on a few servers we have here, but could not reproduce the particular behaviour.

SBehnle · January 25, 2024, 6:41pm

Thanks for the new image and sorry for the late reply (been busy in the server room the whole afternoon).
The image without optimade loads without issues within 1min22sec. The most time consuming step was loading toposort, but during that step, CPU usage was at 100%. In essence, it loads flawlessly.
I have no idea what the optimade package does during import, but could it be that it performs syscalls which are particularly slow/blocking in a virtualized environment?

SBehnle · January 30, 2024, 10:59am

It’s kind of embarrassing, but i solved the problem.
The root cause was a firewall rule which was intended to block access form outside the university IP range to NOMAD. Unfortunately, it also blocked traffic initiated by the virtual docker interface with 172.17.0.0/16 address via eth0 to the outer world . Without this offending firewall rule, startup is complete within one minute (with the release image).
This issue went unnoticed as the DOCKER-USER firewall chain did not produce logging of the dropped packages.
Sorry for the confusion.
To everyone else experiencing similar issues: check your firewall logs, check your firewall rules and ensure that all rules produce logging during deployment.

mscheidgen · January 30, 2024, 1:06pm

Glad it worked out. So I assume that OPTIMADE tried to download it’s list of providers or something from a remote repository during import.

ml-evs · January 30, 2024, 1:38pm

I’ve just stumbled across this (from “namesearching” this forum for OPTIMADE…) – as the developer of the optimade package, sorry that you ran into this edge case! It looks like we are making the classic error of running requests.get without a timeout in one particular place (the provider list download, as Markus suggested). I’ve raised an issue (Missing timeout for `requests.get` in `get_providers` · Issue #1954 · Materials-Consortia/optimade-python-tools · GitHub) and will fix and backport this to v1.x and v0.25.x of the optimade package (not sure which one you are using internally atm Markus). Apologies for being part of this hard to debug issue!

mscheidgen · January 31, 2024, 9:41am

Thanks for chiming in @ml-evs. I guess this gives us a good reason to update our OPT dependency ;-).

SBehnle · January 31, 2024, 10:08am

Thanks for the background information. After all, it was my firewall that caused the issue. I’m rather impressed that the app finally came up at all.

mscheidgen · January 31, 2024, 12:55pm

I guess this was a requests default timeout of ~15min.