Most of what I heard from people from the FireWorks side is that there start to be problems when:
you have a lot of workflows that are each very big. e.g., each workflow has 10,000 Fireworks in it or something.
all jobs are expected to finish at ~the same time. e.g., if you have 10,000 Fireworks from a queue of a million running in parallel and each finishes at the same time, you have 10,000 jobs trying to do database writes at the same time. Normally, the jobs we run have a distribution of runtimes (and even the queue start times are staggered by the queueing system) so I haven’t seen this happen, but I imagine it could be a problem. It can also be a problem if many jobs start running at exactly the same time since each of them is trying to pull a job from the database.
Otherwise, you would need to try it but I don’t know of any problems from the Fireworks side other than what Alex mentioned. That said, depending on the architecture of your computing cluster, there can be problems from the cluster side. Many clusters are not really set up to run several thousands of small jobs at once. For example, they have a “MOM” node that serves as an intermediary between the head node and compute nodes. All Python portions of the Firework are actually performed on the MOM node and only the “mpi” commands are brought over to the compute nodes. If your cluster is this kind of architecture and there is a ratio of say 50 compute nodes per MOM node, you might have Python processes for 50 Fireworks sharing the same MOM node which can stress or crash it.
On Wednesday, February 28, 2018 at 2:09:18 PM UTC-8, Alex Dunn wrote:
I haven’t run into a situation on the scale of yours before, but I have encountered issues with mongo and the number of available connections to the db. Typically these manifest as errors of “too many open files”! These errors can be solved with a handwave if you’re running the db on a unix system: https://docs.mongodb.com/manual/reference/ulimit/
As far as a scalable solution, you might want to look into sharding https://docs.mongodb.com/manual/sharding/ with mongodb
On Thursday, February 8, 2018 at 1:32:51 AM UTC-8, Guido Petretto wrote:
I would like to use fireworks for a project where we will need to run many short workflows, with few steps and with each step lasting from few minutes to a couple of hours on a single core. In principle there should be always several thousands
jobs running at the same time. Up to now I have used fireworks with much smaller numbers and I expect that this may stress firework’s capacity in several ways (e.g. need to constantly submit a large number of jobs that
are also finishing at a high rate; fast increase
of the DB size and so on).
I would like to know if some user is already aware of problems or bottlenecks that we should take into account with this kind of requirements.