Dear NOMAD developers,
i am currently trying to deploy a NOMAD oasis for our institute and am facing issues with the app container. The container takes up to 15 minutes to finally come up for no apparent reason:
[root@u-030-s007 nomad]# docker compose logs app
nomad_oasis_app | [2024-01-24 12:07:44 +0100] [7] [INFO] Starting gunicorn 21.2.0
nomad_oasis_app | [2024-01-24 12:07:44 +0100] [7] [INFO] Listening at: http://0.0.0.0:8000 (7)
nomad_oasis_app | [2024-01-24 12:07:44 +0100] [7] [INFO] Using worker: uvicorn.workers.UvicornWorker
nomad_oasis_app | [2024-01-24 12:07:44 +0100] [20] [INFO] Booting worker with pid: 20
nomad_oasis_app | [2024-01-24 12:07:45 +0100] [21] [INFO] Booting worker with pid: 21
nomad_oasis_app | [2024-01-24 12:07:45 +0100] [22] [INFO] Booting worker with pid: 22
nomad_oasis_app | [2024-01-24 12:07:45 +0100] [23] [INFO] Booting worker with pid: 23
nomad_oasis_app | [2024-01-24 12:21:03 +0100] [23] [INFO] Started server process [23]
nomad_oasis_app | [2024-01-24 12:21:03 +0100] [23] [INFO] Waiting for application startup.
nomad_oasis_app | [2024-01-24 12:21:03 +0100] [21] [INFO] Started server process [21]
nomad_oasis_app | [2024-01-24 12:21:03 +0100] [21] [INFO] Waiting for application startup.
nomad_oasis_app | [2024-01-24 12:21:03 +0100] [22] [INFO] Started server process [22]
nomad_oasis_app | [2024-01-24 12:21:03 +0100] [22] [INFO] Waiting for application startup.
nomad_oasis_app | [2024-01-24 12:21:03 +0100] [20] [INFO] Started server process [20]
nomad_oasis_app | [2024-01-24 12:21:03 +0100] [20] [INFO] Waiting for application startup.
nomad_oasis_app | [2024-01-24 12:22:13 +0100] [21] [INFO] Application startup complete.
nomad_oasis_app | [2024-01-24 12:22:14 +0100] [20] [INFO] Application startup complete.
nomad_oasis_app | [2024-01-24 12:22:16 +0100] [22] [INFO] Application startup complete.
nomad_oasis_app | [2024-01-24 12:22:18 +0100] [23] [INFO] Application startup complete.
During the first minute of startup, docker top
shows quite some load but that rapidly drops and the container just seems to idle (no CPU load, no disk I/O, no network I/O, enough RAM available).
While trying to track down the issue, i attached strace to the python process and observed that it is doing file system syscalls:
…
pselect6(4, [3], , , {tv_sec=1, tv_nsec=0}, NULL) = 0 (Timeout)
newfstatat(6, “”, {st_mode=S_IFREG|600, st_size=0, …}, AT_EMPTY_PATH) = 0
newfstatat(7, “”, {st_mode=S_IFREG|600, st_size=0, …}, AT_EMPTY_PATH) = 0
newfstatat(8, “”, {st_mode=S_IFREG|600, st_size=0, …}, AT_EMPTY_PATH) = 0
newfstatat(9, “”, {st_mode=S_IFREG|600, st_size=0, …}, AT_EMPTY_PATH) = 0
…
The moment the application recovers and finally starts up, these syscalls change to
…
pselect6(4, [3], , , {tv_sec=1, tv_nsec=0}, NULL) = 0 (Timeout)
newfstatat(6, “”, {st_mode=S_IFREG|001, st_size=0, …}, AT_EMPTY_PATH) = 0
newfstatat(7, “”, {st_mode=S_IFREG|001, st_size=0, …}, AT_EMPTY_PATH) = 0
newfstatat(8, “”, {st_mode=S_IFREG|001, st_size=0, …}, AT_EMPTY_PATH) = 0
newfstatat(9, “”, {st_mode=S_IFREG|001, st_size=0, …}, AT_EMPTY_PATH) = 0
…
accompanied with an increase in CPU usage. After that, the container successfully starts and NOMAD becomes usable.
During the whole 14 minute startup delay, the machine is effectively idling (no CPU load, no disk or network I/O).
The machine NOMAD is running in is a VM with four cores and 16 GB RAM hosted on rather recent hardware.
The NOMAD volumes are located on a large NFS mount. Might this cause the problem?
The underlying docker engine is docker 25.0 on Rocky Linux 9.3:
[root@u-030-s007 fs]# docker info
Client: Docker Engine - Community
Version: 25.0.0
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc.)
Version: v0.12.1
Path: /usr/libexec/docker/cli-plugins/docker-buildx
compose: Docker Compose (Docker Inc.)
Version: v2.24.1
Path: /usr/libexec/docker/cli-plugins/docker-composeServer:
Containers: 7
Running: 7
Paused: 0
Stopped: 0
Images: 12
Server Version: 25.0.0
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Using metacopy: false
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: systemd
Cgroup Version: 2
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
Swarm: inactive
Runtimes: runc io.containerd.runc.v2
Default Runtime: runc
Init Binary: docker-init
containerd version: a1496014c916f9e62104b33d1bb5bd03b0858e59
runc version: v1.1.11-0-g4bccb38
init version: de40ad0
Security Options:
seccomp
Profile: builtin
cgroupns
Kernel Version: 5.14.0-362.13.1.el9_3.x86_64
Operating System: Rocky Linux 9.3 (Blue Onyx)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 15.56GiB
Name: u-030-s007
ID: 17341a6c-20f5-4a76-a0fd-8cf7ecddaf09
Docker Root Dir: /dockerdata/volumes
Debug Mode: false
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
SELinux is enabled, but there are no SELinux alerts whatsoever.
Simply disabling the nexus parser as proposed in Issue with running app container (NOMAD oasis) is not an option.
Is there any way to further narrow down and eliminate the issue? What am i doing wrong?
Btw., some container feature the environment variable NOMAD_FS_EXTERNAL_WORKING_DIRECTORY
, typically set to $PWD
, for which i was unable to find any documentation.
Best
Stefan
These are my (redacted) docker-compose.yaml and nomad.yaml (first post, unable to create attachments):
version: "3"
services:
# no internal keycloak!
# keycloak user management
#keycloak:
# restart: unless-stopped
# image: jboss/keycloak:16.1.1
# container_name: nomad_oasis_keycloak
# environment:
# - PROXY_ADDRESS_FORWARDING=true
# - KEYCLOAK_USER=admin
# - KEYCLOAK_PASSWORD=password
# - KEYCLOAK_FRONTEND_URL=http://localhost/keycloak/auth
# - KEYCLOAK_IMPORT="/tmp/nomad-realm.json"
# command:
# - "-Dkeycloak.import=/tmp/nomad-realm.json -Dkeycloak.migration.strategy=IGNORE_EXISTING -Dkeycloak.profile.feature.upload_scripts=enabled"
# - "-Dkeycloak.import=/tmp/nomad-realm.json -Dkeycloak.migration.strategy=IGNORE_EXISTING"
# volumes:
# - keycloak:/opt/jboss/keycloak/standalone/data
#- ./configs/nomad-realm.json:/tmp/nomad-realm.json
# healthcheck:
#test:
# - "CMD"
# - "curl"
# - "--fail"
# - "--silent"
# - "http://127.0.0.1:9990/health/live"
#interval: 10s
#timeout: 10s
#retries: 30
#start_period: 30s
# broker for celery
rabbitmq:
restart: unless-stopped
image: rabbitmq:3.11.5 # replacable by "latest"??
container_name: nomad_oasis_rabbitmq
environment:
- TZ=Europe/Berlin
- RABBITMQ_ERLANG_COOKIE=<snip> # SECURITY!!! CHANGE!!!
- RABBITMQ_DEFAULT_USER=rabbitmq
- RABBITMQ_DEFAULT_PASS=<snip> # SECURITY!!! CHANGE?
- RABBITMQ_DEFAULT_VHOST=/
volumes:
- rabbitmq:/var/lib/rabbitmq
healthcheck:
test: ["CMD", "rabbitmq-diagnostics", "--silent", "--quiet", "ping"]
interval: 10s
timeout: 10s
retries: 30
start_period: 10s
#networks:
#- nomad_oasis_network
# the search engine
elastic:
restart: unless-stopped
image: docker.elastic.co/elasticsearch/elasticsearch:7.17.1 # replacable by "latest"??
container_name: nomad_oasis_elastic
environment:
- TZ=Europe/Berlin
- ES_JAVA_OPTS=-Xms512m -Xmx512m # change these if elastic demands more memory or crashes
- discovery.type=single-node
volumes:
- elastic:/usr/share/elasticsearch/data
healthcheck:
test:
- "CMD"
- "curl"
- "--fail"
- "--silent"
- "http://elastic:9200/_cat/health"
interval: 10s
timeout: 10s
retries: 30
start_period: 60s
#networks:
#- nomad_oasis_network
# the user data db
mongo:
restart: unless-stopped
image: mongo:5.0.6 # replacable by "latest"??
container_name: nomad_oasis_mongo
environment:
- TZ=Europe/Berlin
- MONGO_DATA_DIR=/data/db
- MONGO_LOG_DIR=/dev/null
volumes:
- mongo:/data/db # this is a docker volume; can this be pointed to a bwSFS mount?
- /bwsfs/NOMAD_LISAPLUS_PRODUCTION/nomad_volumes/mongo:/backup # points to bind mount! How often does the backup run? Is it sufficient to back up to bwSFS? How to restore after complete disaster?
command: mongod --logpath=/dev/null # --quiet
healthcheck:
test:
- "CMD"
- "mongo"
- "mongo:27017/test"
- "--quiet"
- "--eval"
- "'db.runCommand({ping:1}).ok'"
interval: 10s
timeout: 10s
retries: 30
start_period: 10s
#networks:
#- nomad_oasis_network
# nomad worker (processing)
worker:
restart: unless-stopped
image: gitlab-registry.mpcdf.mpg.de/nomad-lab/nomad-fair:latest #v1.2.1 # maybe fix to tested version in production!
container_name: nomad_oasis_worker
environment:
TZ: Europe/Berlin
NOMAD_SERVICE: nomad_oasis_worker
NOMAD_RABBITMQ_HOST: rabbitmq
NOMAD_ELASTIC_HOST: elastic
NOMAD_MONGO_HOST: mongo
# NOMAD_LOGSTASH_HOST: logtransfer # copied from recent template
OMP_NUM_THREADS: 2 # adjust depending on VM size
depends_on:
rabbitmq:
condition: service_healthy
elastic:
condition: service_healthy
mongo:
condition: service_healthy
volumes:
- ./configs/nomad.yaml:/app/nomad.yaml
- /bwsfs/NOMAD_LISAPLUS_PRODUCTION/nomad_volumes/fs:/app/.volumes/fs # what is stored here???
command: python -m celery -A nomad.processing worker -l info -Q celery
extra_hosts:
- keycloak.my-uni.de:1.2.3.4
#networks:
#- nomad_oasis_network
# nomad app (api + proxy)
app:
restart: unless-stopped
image: gitlab-registry.mpcdf.mpg.de/nomad-lab/nomad-fair:latest #v1.2.1 # same image as worker, fix to the same version
container_name: nomad_oasis_app
environment:
TZ: Europe/Berlin
NOMAD_SERVICE: nomad_oasis_app
NOMAD_SERVICES_API_PORT: 80 # Do we need API access? Does it have to be port 80 or can we move to 443? How does API access work in detail?
# NOMAD_FS_EXTERNAL_WORKING_DIRECTORY: "$PWD"
NOMAD_FS_EXTERNAL_WORKING_DIRECTORY: "/bwsfs/NOMAD_LISAPLUS_PRODUCTION/"
NOMAD_RABBITMQ_HOST: rabbitmq
NOMAD_ELASTIC_HOST: elastic
NOMAD_MONGO_HOST: mongo
# NOMAD_LOGSTASH_HOST: logtransfer # copied from recent template
NOMAD_NORTH_HUB_HOST: north # remove comment?
GUNICORN_CMD_ARGS: "--access-logfile '-' --error-logfile '-' --log-level 'debug' "
depends_on:
rabbitmq:
condition: service_healthy
elastic:
condition: service_healthy
mongo:
condition: service_healthy
north:
condition: service_started
# keycloak:
# condition: service_started
volumes:
- ./configs/nomad.yaml:/app/nomad.yaml
- /bwsfs/NOMAD_LISAPLUS_PRODUCTION/nomad_volumes/fs:/app/.volumes/fs
command: ./run.sh
healthcheck:
test:
- "CMD"
- "curl"
- "--fail"
- "--silent"
- "http://localhost:8000/-/health"
interval: 60s # 10s
timeout: 10s
retries: 30
start_period: 60s # 10 -> 60 per https://matsci.org/t/issue-with-running-app-container-nomad-oasis/51306
extra_hosts:
- keycloak.my-uni.de:1.2.3.4
#networks:
#- nomad_oasis_network
# nomad remote tools hub (JupyterHUB, e.g. for AI Toolkit)
north:
restart: unless-stopped
image: gitlab-registry.mpcdf.mpg.de/nomad-lab/nomad-fair:latest #v1.2.1 # Same image as worker and app, fix to production version; test upgrade!!!
container_name: nomad_oasis_north
environment:
TZ: Europe/Berlin
NOMAD_SERVICE: nomad_oasis_north
NOMAD_NORTH_DOCKER_NETWORK: nomad_oasis_network
NOMAD_NORTH_HUB_CONNECT_IP: north
NOMAD_NORTH_HUB_IP: "0.0.0.0"
NOMAD_NORTH_HUB_HOST: north
NOMAD_SERVICES_API_HOST: app
NOMAD_FS_EXTERNAL_WORKING_DIRECTORY: "/bwsfs/NOMAD_LISAPLUS_PRODUCTION/" #"$PWD"
NOMAD_RABBITMQ_HOST: rabbitmq
NOMAD_ELASTIC_HOST: elastic
NOMAD_MONGO_HOST: mongo
# depends_on:
# keycloak:
# condition: service_started
# app:
# condition: service_started
volumes:
- ./configs/nomad.yaml:/app/nomad.yaml
- /bwsfs/NOMAD_LISAPLUS_PRODUCTION/nomad_volumes/fs:/app/.volumes/fs
- /var/run/docker.sock:/var/run/docker.sock
user: '1000:991' # replace by the output of "getent passwd | grep nomad" and getent group | grep docker";
command: python -m nomad.cli admin run hub
healthcheck:
test:
- "CMD"
- "curl"
- "--fail"
- "--silent"
- "http://localhost:8081/nomad-oasis/north/hub/health"
interval: 10s
timeout: 10s
retries: 30
start_period: 10s
extra_hosts:
- keycloak.my-uni.de:1.2.3.4
#networks:
#- nomad_oasis_network
# nomad logtransfer
# == DOES EXACTLY WHAT????? ==
# to enable the logtransfer service run "docker compose --profile with_logtransfer up"
logtransfer:
restart: unless-stopped
image: gitlab-registry.mpcdf.mpg.de/nomad-lab/nomad-fair:latest #v1.2.1
container_name: nomad_oasis_logtransfer
environment:
TZ: Europe/Berlin
NOMAD_SERVICE: nomad_oasis_logtransfer
NOMAD_ELASTIC_HOST: elastic
NOMAD_MONGO_HOST: mongo
depends_on:
elastic:
condition: service_healthy
mongo:
condition: service_healthy
volumes:
- ./configs/nomad.yaml:/app/nomad.yaml
- /bwsfs/NOMAD_LISAPLUS_PRODUCTION/nomad_volumes/fs:/app/.volumes/fs
command: python -m nomad.cli admin run logtransfer
profiles: ["with_logtransfer"]
#networks:
#- nomad_oasis_network
# nomad proxy (a reverse proxy for nomad)
proxy:
restart: unless-stopped
image: nginx:latest #1.13.9-alpine # upgradable? latest-alpine?
container_name: nomad_oasis_proxy
command: nginx -g 'daemon off;'
environment:
- TZ=Europe/Berlin
volumes:
- ./configs/nginx.conf:/etc/nginx/conf.d/default.conf:ro
- ./ssl/:/etc/ssl/:ro
depends_on:
# keycloak:
# condition: service_healthy
app:
condition: service_healthy #service_started
worker:
condition: service_started # TODO: service_healthy
north:
condition: service_healthy
ports:
- "80:80" # standard http, try to get rid of...
- "443:443" # standard https, try to handle all traffic via https
#networks:
#- nomad_oasis_network
volumes:
mongo:
name: "nomad_oasis_mongo"
elastic:
name: "nomad_oasis_elastic"
rabbitmq:
name: "nomad_oasis_rabbitmq"
# keycloak:
# name: "nomad_oasis_keycloak"
networks:
default:
name: nomad_oasis_network
#networks:
# nomad_oasis_network:
# driver: bridge
services:
# api_host: 'localhost'
api_host: 'nomad.my-uni.de'
api_port: 443
api_base_path: '/nomad-oasis'
api_timeout: 6000
https: True
https_upload: True
admin_user_id: <snip> # TODO replace
# aitoolkit_enabled: True
console_log_level: 10
upload_limit: 100000
oasis:
is_oasis: True
uses_central_user_management: False
# recreate key!!
north:
enabled: True
jupyterhub_crypt_key: <snip>
keycloak:
server_url: 'https://keycloak.my-uni.de/'
public_server_url: 'https://keycloak.my-uni.de/'
realm_name: <snip>
username: <snip>
password: <snip>
client_id: <snip>
client_secret: <snip>
# redirect_uri: 'https://nomad.my-uni.de/nomad-oasis/gui/*'
# client_id: 'nomad_public'
meta:
deployment: 'oasis'
deployment_url: 'https://nomad.my-uni.de/api'
maintainer_email: <snip>
mongo:
db_name: nomad_oasis_v1
elastic:
entries_index: nomad_oasis_entries_v1
materials_index: nomad_oasis_materials_v1
# see https://matsci.org/t/issue-with-running-app-container-nomad-oasis/51306
#plugins:
# exclude:
# - parsers/nexus