First a bit of context:
I run dynamic workflows which grows along time by appending workflows with some fireworks dedicated to build and append them: one task which return FWAction(additions=new_wf).
I produce this way a large amount of fireworks (several tens of thousands) for each initial workflow.
I schedule these fireworks on a large amount of cores on a HPC cluster (several hundreds)
These tasks execute quite quickly.
At the beginning (mondoDB freed), everything runs fine. But after a while, when the number of fws get higher, the system seems to be unable to detect properly the completion of the fws : A lot of tasks stays in RUNNING state and therefore don’t allow next WAITING tasks to pass in a READY state, freezing this way the workflows execution.
I never experienced that in my previous runs, maybe because i was using far less tasks that were taking more time to complete, or because i was generating lesser fireworks, or because I was deleting completed workflows before the number of fireworks in DB become high …
I have noticed that when the number of fireworks grows in the mongoDB, the computation needs on the server for each operation grows very quickly …
I tried to overcome the problem with a process running:
until lpad detect_lostruns --refresh ; do echo “retry”; done
( I used until because it failed often with messages like 2018-01-08 15:37:27,876 INFO fw_id 38807 locked. Can’t refresh!or pymongo.errors.CursorNotFound: Cursor not found, cursor id: 22434284420 ).
The same problem occurs when I try to delete completed workflows in an attempt to lowering MongoDB workload : it is very (!) long.
So I wonder on the scalability of Fireworks and if there are some guidelines that I should consider to achieve a good performance and scalability (because we plan to increase the load in the next future (more fws, more compute nodes)
Furthermore Is there a way for fireworks to detect lost runs automatically ? these lost runs stay lost for several days while these jobs should be marked as fizzled since they don’t have ping the server every half hour:
PING_TIME_SECS = 1800 # while Running a job, how often to ping back the server that we’re still alive
RUN_EXPIRATION_SECS = PING_TIME_SECS * 2 # mark job as FIZZLED if not pinged in this time
Have you some advice to overcome this problem ?
I use Fireworks v1.4.1 with python3, and mongodb v3.4.10
PS : for the server that hosts mongodb, We use a VM with 4GB RAM & 4 cores (increased from previous conf of 2 cores & 2GB, but it didn’t solve the problem, just delayed it a bit)