How to speed-up WF deleting when having large number of tasks in DB?

Hi,

I have a problem deleting completed workflow when the number of tasks is important:

$ lpad report -i months

Stats on fws

Stats for time-period 2020-01

ARCHIVED : 0
FIZZLED : 6
DEFUSED : 377
PAUSED : 0
WAITING : 5065
READY : 1752
RESERVED : 0
RUNNING : 0
COMPLETED : 115272

total : 122472
C/(C+F) : 100.0

I can not remove a workflow, this command is stucked:

lpad delete_wflows -i 53248

I have tried all the admin cmd: tuneup, unlock, maintain, have stopped the runs, but the process of deleting a completed WF seems to take a huge time to complete.

Is there a way to accelerate it ? Or to do it an other way ?

Do you think Fireworks does scale enough to be suited to my needs ? Maybe do I something wrong ?

(I have 40 compute nodes for an amount of ~ 1000 cores and I use Fireworks to schedule all my jobs on to the nodes. This lead to a large amount of tasks in the DB before I can delete completed workflows.)

Best regards,
David Michéa

By the way I have timed the deletion of 4 COMPLETED WFs:

2020-01-31 14:38:25,452 INFO Finished deleting 4 WFs

real 142m31.631s
user 0m7.479s
sys 0m1.155s

So it is not completely stucked, but incredibly slow.

Hi @David_Michea,

I’ll let someone else address the slowness issue, but for what reason are you wanting to delete workflows? We routinely keep databases with hundreds of thousands of finished workflows in such that we can go back and check calculation history. If necessary, workflows can be archived also.

Best,

Matt

Hi Matt,

I want to remove them because they are dramatically slowing down the MongoDB server !

I run MongoDB server on a Virtual Machine instance and when the number of WFs in the DB increase, the ressources (RAM + CPU) needed by the VM increase also (no idea if it’s linear or super linear).

with +120000 FWs in database (for 92 WFs), the 4 CPUs I have assigned to this VM are almost 100% full…
This also due to the fact that I built a kind of scheduler on top of fireworks to efficiently schedule all the fws of different category (number of cores needed) on each compute node and to automatically provision or release nodes when needed.
These scheduler instances query periodically the launchpad, therefore increasing its load.

I also use a lot of dynamic workflows: a dedicated task at the end of the workflow is responsible to append a new workflow to itself, leading to huge workflows (not so many WFs, but each WF as a huge number of FWs). I don’t know if this could be responsible of this behaviour.

Best regards,
David

Ok, that sounds completely reasonable – large databases do require a lot of RAM, and dynamic workflows with large number of fireworks are definitely among the most difficult cases. If you’re running your own Mongo server, you can check if compression (WiredTiger) is enabled/disabled also since that can have a large effect on performance also.

Just a hunch, but I imagine deleting a workflow is slow because of the large number of fireworks you have in a workflow. I imagine we haven’t hit this hot spot before because we don’t typically have that number in a given workflow. It seems like there are some clear optimizations that could be made here to make speeding it up faster, so I’ll have a chat with the developers. Can I ask, are you using the delete_launch_dirs option?