Scalability of rlaunch multi

Hi,

I am trying to pack lots of small jobs into a large slurm job with rlaunch multi. I am curious if there are any benchmarks on the scalability of rlaunch multi. For example, would it be possible to do rlaunch multi 1000? Is there anything that I need to consider if I want to scale up rlaunch multi?

Thanks for your helps!

1 Like

I can only speak from experience, but the safe practical limit I ran into (IIRC) was ~50 running on a compute node. I remember a few times getting it to work with a few hundred or a thousand but they ran into errors at a significant rate. I am assuming the commands FireWorks is running are mpi commands (i.e., the rlaunch multi runs on one node, the mpi processes run across the active node set)?

Thank you for your reply! Yes, the commands are MPI commands. In my case, rlaunch multi runs on multiple nodes, and each sub-job occupies one node. E.g. rlaunch multi 50 would run on 50 nodes. I am doing this because the queuing system prioritize larger jobs.

Do you know what might be causing the errors? Is it specific to using rlaunch multi on a single compute node?

If i remember correctly, it had to do with (basically) running out of memory on the node the fworker processes are running on. If you have 50+ rlaunch processes on a single (let’s say) 24-core node managing 50 different MPI calculations, Fireworks may not work as intended.

There may be other Fireworks aficionados out there who have different experiences or tips though!

Yes I think 50-100 is usually a good range

FWS uses multiprocessing to do job packing and if there more than 100 processes running on a node that can cause slowdown and memory issues. But specifics will depend on the configuration of the node that you running rlaunch multi on.