unable to start nomad_oasis_app- SIGKILL out of memory

sherin · January 14, 2025, 2:10pm

I get the Worker was sent SIGKILL! Perhaps out of memory?
I tried upgrading the memory to 16GB. Still I get the same error.
I have 4 CPU cores in my machine.
I don’t think that memory is a problem, because in a different VM instance, I was able to run using 8GB.

Can anyone tell if this is really a RAM not sufficient issue or something else. What I can do to debug this? Any help on this matter is greatly appreciated.

Some relevant docs and system status are shown below:

[ec-sherinsu@u-geovis01 nomad-oasis]$ docker logs nomad_oasis_app
[2025-01-14 12:55:42 +0000] [7] [INFO] Starting gunicorn 21.2.0
[2025-01-14 12:55:42 +0000] [7] [INFO] Listening at: http://0.0.0.0:8000 (7)
[2025-01-14 12:55:42 +0000] [7] [INFO] Using worker: uvicorn.workers.UvicornWorker
[2025-01-14 12:55:42 +0000] [23] [INFO] Booting worker with pid: 23
[2025-01-14 12:55:42 +0000] [24] [INFO] Booting worker with pid: 24
[2025-01-14 12:55:42 +0000] [25] [INFO] Booting worker with pid: 25
[2025-01-14 12:55:43 +0000] [26] [INFO] Booting worker with pid: 26
[2025-01-14 13:05:43 +0000] [7] [CRITICAL] WORKER TIMEOUT (pid:23)
[2025-01-14 13:05:43 +0000] [7] [CRITICAL] WORKER TIMEOUT (pid:24)
[2025-01-14 13:05:43 +0000] [7] [CRITICAL] WORKER TIMEOUT (pid:25)
[2025-01-14 13:05:43 +0000] [7] [CRITICAL] WORKER TIMEOUT (pid:26)
[2025-01-14 13:05:44 +0000] [7] [ERROR] Worker (pid:26) was sent SIGKILL! Perhaps out of memory?
[2025-01-14 13:05:44 +0000] [7] [ERROR] Worker (pid:24) was sent SIGKILL! Perhaps out of memory?
[2025-01-14 13:05:44 +0000] [222] [INFO] Booting worker with pid: 222
[2025-01-14 13:05:44 +0000] [7] [ERROR] Worker (pid:23) was sent SIGKILL! Perhaps out of memory?
[2025-01-14 13:05:44 +0000] [7] [ERROR] Worker (pid:25) was sent SIGKILL! Perhaps out of memory?
[2025-01-14 13:05:44 +0000] [223] [INFO] Booting worker with pid: 223
[2025-01-14 13:05:44 +0000] [224] [INFO] Booting worker with pid: 224
[2025-01-14 13:05:44 +0000] [225] [INFO] Booting worker with pid: 225
[2025-01-14 13:19:09 +0000] [7] [INFO] Starting gunicorn 21.2.0
[2025-01-14 13:19:09 +0000] [7] [INFO] Listening at: http://0.0.0.0:8000 (7)
[2025-01-14 13:19:09 +0000] [7] [INFO] Using worker: uvicorn.workers.UvicornWorker
[2025-01-14 13:19:09 +0000] [18] [INFO] Booting worker with pid: 18
[ec-sherinsu@u-geovis01 nomad-oasis]$ docker logs nomad_oasis_app
[2025-01-14 12:55:42 +0000] [7] [INFO] Starting gunicorn 21.2.0
[2025-01-14 12:55:42 +0000] [7] [INFO] Listening at: http://0.0.0.0:8000 (7)
[2025-01-14 12:55:42 +0000] [7] [INFO] Using worker: uvicorn.workers.UvicornWorker
[2025-01-14 12:55:42 +0000] [23] [INFO] Booting worker with pid: 23
[2025-01-14 12:55:42 +0000] [24] [INFO] Booting worker with pid: 24
[2025-01-14 12:55:42 +0000] [25] [INFO] Booting worker with pid: 25
[2025-01-14 12:55:43 +0000] [26] [INFO] Booting worker with pid: 26
[2025-01-14 13:05:43 +0000] [7] [CRITICAL] WORKER TIMEOUT (pid:23)
[2025-01-14 13:05:43 +0000] [7] [CRITICAL] WORKER TIMEOUT (pid:24)
[2025-01-14 13:05:43 +0000] [7] [CRITICAL] WORKER TIMEOUT (pid:25)
[2025-01-14 13:05:43 +0000] [7] [CRITICAL] WORKER TIMEOUT (pid:26)
[2025-01-14 13:05:44 +0000] [7] [ERROR] Worker (pid:26) was sent SIGKILL! Perhaps out of memory?
[2025-01-14 13:05:44 +0000] [7] [ERROR] Worker (pid:24) was sent SIGKILL! Perhaps out of memory?
[2025-01-14 13:05:44 +0000] [222] [INFO] Booting worker with pid: 222
[2025-01-14 13:05:44 +0000] [7] [ERROR] Worker (pid:23) was sent SIGKILL! Perhaps out of memory?
[2025-01-14 13:05:44 +0000] [7] [ERROR] Worker (pid:25) was sent SIGKILL! Perhaps out of memory?
[2025-01-14 13:05:44 +0000] [223] [INFO] Booting worker with pid: 223
[2025-01-14 13:05:44 +0000] [224] [INFO] Booting worker with pid: 224
[2025-01-14 13:05:44 +0000] [225] [INFO] Booting worker with pid: 225
[2025-01-14 13:19:09 +0000] [7] [INFO] Starting gunicorn 21.2.0
[2025-01-14 13:19:09 +0000] [7] [INFO] Listening at: http://0.0.0.0:8000 (7)
[2025-01-14 13:19:09 +0000] [7] [INFO] Using worker: uvicorn.workers.UvicornWorker
[2025-01-14 13:19:09 +0000] [18] [INFO] Booting worker with pid: 18
[ec-sherinsu@u-geovis01 nomad-oasis]$ pwd
/home/ec-sherinsu/nomad-oasis
[ec-sherinsu@u-geovis01 nomad-oasis]$ df -h
Filesystem                 Size  Used Avail Use% Mounted on
devtmpfs                   3.8G     0  3.8G   0% /dev
tmpfs                      3.8G     0  3.8G   0% /dev/shm
tmpfs                      3.8G  1.6M  3.8G   1% /run
tmpfs                      3.8G     0  3.8G   0% /sys/fs/cgroup
/dev/mapper/internvg-root  8.0G  446M  7.6G   6% /
/dev/mapper/internvg-usr    12G  8.7G  3.4G  73% /usr
/dev/mapper/internvg-tmp   4.0G   61M  4.0G   2% /tmp
/dev/mapper/internvg-opt   4.0G   72M  4.0G   2% /opt
/dev/sda1                  507M  391M  117M  78% /boot
/dev/mapper/internvg-var   8.0G  2.8G  5.3G  35% /var
tmpfs                      1.6G     0  1.6G   0% /run/user/2100510
[ec-sherinsu@u-geovis01 nomad-oasis]$ df -h /home
Filesystem                 Size  Used Avail Use% Mounted on
/dev/mapper/internvg-root  8.0G  446M  7.6G   6% /
[ec-sherinsu@u-geovis01 nomad-oasis]$

Console output from a different attempt

[ec-sherinsu@u-geovis01 nomad-oasis]$ sudo docker compose up -d
WARN[0000] The "PWD" variable is not set. Defaulting to a blank string. 
WARN[0000] The "PWD" variable is not set. Defaulting to a blank string. 
[+] Running 6/6
 ✔ Container nomad_oasis_elastic   Healthy                                                                                                                                 0.5s 
 ✔ Container nomad_oasis_rabbitmq  Healthy                                                                                                                                 0.5s 
 ✔ Container nomad_oasis_mongo     Healthy                                                                                                                                 0.5s 
 ✔ Container nomad_oasis_north     Healthy                                                                                                                                 1.0s 
 ✔ Container nomad_oasis_worker    Running                                                                                                                                 0.0s 
 ✘ Container nomad_oasis_app       Error                                                                                                                                   1.0s 
dependency failed to start: container nomad_oasis_app is unhealthy
[ec-sherinsu@u-geovis01 nomad-oasis]$ docker logs nomad_oasis_app
[2025-01-14 13:48:16 +0000] [7] [INFO] Starting gunicorn 21.2.0
[2025-01-14 13:48:16 +0000] [7] [INFO] Listening at: http://0.0.0.0:8000 (7)
[2025-01-14 13:48:16 +0000] [7] [INFO] Using worker: uvicorn.workers.UvicornWorker
[2025-01-14 13:48:16 +0000] [24] [INFO] Booting worker with pid: 24
[2025-01-14 13:48:17 +0000] [25] [INFO] Booting worker with pid: 25
[2025-01-14 13:48:17 +0000] [26] [INFO] Booting worker with pid: 26
[2025-01-14 13:48:17 +0000] [27] [INFO] Booting worker with pid: 27
[2025-01-14 13:58:17 +0000] [7] [CRITICAL] WORKER TIMEOUT (pid:24)
[2025-01-14 13:58:17 +0000] [7] [CRITICAL] WORKER TIMEOUT (pid:25)
[2025-01-14 13:58:17 +0000] [7] [CRITICAL] WORKER TIMEOUT (pid:26)
[2025-01-14 13:58:17 +0000] [7] [CRITICAL] WORKER TIMEOUT (pid:27)
[2025-01-14 13:58:18 +0000] [7] [ERROR] Worker (pid:26) was sent SIGKILL! Perhaps out of memory?
[2025-01-14 13:58:18 +0000] [7] [ERROR] Worker (pid:25) was sent SIGKILL! Perhaps out of memory?
[2025-01-14 13:58:18 +0000] [221] [INFO] Booting worker with pid: 221
[2025-01-14 13:58:18 +0000] [7] [ERROR] Worker (pid:24) was sent SIGKILL! Perhaps out of memory?
[2025-01-14 13:58:18 +0000] [7] [ERROR] Worker (pid:27) was sent SIGKILL! Perhaps out of memory?
[2025-01-14 13:58:18 +0000] [222] [INFO] Booting worker with pid: 222
[2025-01-14 13:58:18 +0000] [223] [INFO] Booting worker with pid: 223
[2025-01-14 13:58:18 +0000] [224] [INFO] Booting worker with pid: 224
[ec-sherinsu@u-geovis01 nomad-oasis]$

ahmedilyas · January 15, 2025, 9:30am

Hello!

Can you share which OS was being used locally, and also on the VM?

And which image was used in the docker-compose.yml file to run the nomad worker?

sherin · January 15, 2025, 10:42am

Thanks for commenting on this!

I am using RHEL.

[ec-sherinsu@u-geovis01 nomad-oasis]$ cat /etc/os-release
NAME="Red Hat Enterprise Linux"
VERSION="8.10 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.10"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux 8.10 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::baseos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8"
BUG_REPORT_URL="https://issues.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_BUGZILLA_PRODUCT_VERSION=8.10
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.10"

and I think I am using the latest image in my yaml file

image: gitlab-registry.mpcdf.mpg.de/nomad-lab/nomad-fair:latest

Few more points

I am able to run the docker-based NOMAD in my local ubuntu-based VMs. I am following the same installation/usage steps as outlined in the nomad-oasis website.
I am not doing too much modification in the yaml or config files. I only modified the port number from 80:80 to 1234:80. The same change worked for me in the ubuntu VMs.

ahmedilyas · January 16, 2025, 2:18pm

Thanks for sharing the info!

Since they’re identical images, it’s a bit hard to say why you’re running into memory issues on one machine but not the other.

One possible solution would be for you to have your own custom oasis image. This can be done very easily by following this template: GitHub - FAIRmat-NFDI/nomad-distro-template: An example repository for creating a nomad distribution with custom plugins.

After that, you can reduce the number of plugins installed by default by removing them from the pyproject.toml file. This can help in reducing the memory requirements, especially since some of the plugins installed by default in the latest image can be a bit big.