Scalability of rlaunch multi

txie · June 2, 2020, 10:13pm

Hi,

I am trying to pack lots of small jobs into a large slurm job with rlaunch multi. I am curious if there are any benchmarks on the scalability of rlaunch multi. For example, would it be possible to do rlaunch multi 1000? Is there anything that I need to consider if I want to scale up rlaunch multi?

Thanks for your helps!

ardunn · June 3, 2020, 1:42am

I can only speak from experience, but the safe practical limit I ran into (IIRC) was ~50 running on a compute node. I remember a few times getting it to work with a few hundred or a thousand but they ran into errors at a significant rate. I am assuming the commands FireWorks is running are mpi commands (i.e., the rlaunch multi runs on one node, the mpi processes run across the active node set)?

txie · June 4, 2020, 3:08am

Thank you for your reply! Yes, the commands are MPI commands. In my case, rlaunch multi runs on multiple nodes, and each sub-job occupies one node. E.g. rlaunch multi 50 would run on 50 nodes. I am doing this because the queuing system prioritize larger jobs.

Do you know what might be causing the errors? Is it specific to using rlaunch multi on a single compute node?

ardunn · June 4, 2020, 11:54pm

If i remember correctly, it had to do with (basically) running out of memory on the node the fworker processes are running on. If you have 50+ rlaunch processes on a single (let’s say) 24-core node managing 50 different MPI calculations, Fireworks may not work as intended.

There may be other Fireworks aficionados out there who have different experiences or tips though!

Anubhav_Jain · June 5, 2020, 2:00pm

Yes I think 50-100 is usually a good range

FWS uses multiprocessing to do job packing and if there more than 100 processes running on a node that can cause slowdown and memory issues. But specifics will depend on the configuration of the node that you running rlaunch multi on.