Batch Optimization in Rocketsled Doesnot Run in Parallel

Awahab · April 27, 2022, 10:30am

I am running batch optimization using the code with DFT. I have 24 cores cpu so the objective was to run several DFT calculations in parallel using batch optimization. The Rocketsled even-though provides batch suggestions, those batch suggestions however are not run in parallel, rather sequential. So if the batch_size is say 4, I do get 4 suggestions but those 4 suggestions are run in sequential. My workflow is setup as follows:

optimization_task = Firework([OptTask(**db_info)],name="optimization_task")
workflow = Workflow([fw1,fw2,optimization_task],{fw1:fw2, fw2:optimization_task})

So even-though the topology is sequential fw1 -> fw2 -> optimization_task for batch of N, N workflows are added, however, all these N workflows are executed in sequential!

Is there a way to parallelize so I can run several DFT calculations in cluster?

Note: I looked into the option of enforce_sequential but that i am fine for optimizer to wait for all the batches and provide suggestions that are not duplicate. As DFT calculations take more or less the same time so its worth waiting

ardunn · April 27, 2022, 8:03pm

Hey @Awahab

How are you actually running the workflows on the cluster?

Rocketsled does not have any capability to actually run the workflows in parallel for you (that’s Fireworks), it can only submit workflows which can be run in parallel if you choose to run them that way.

If you want to run several workflows/fws in parallel on the same node, you’ll probably want to use rlaunch multi from fireworks to launch them in separate processes (see the fireworks docs). If you want to run several workflows/fws in parallel across multiple nodes, you just need to pull and run jobs on each node (see this part of the fireworks docs).

If you have already considered this and it is still not working, what fireworks commands are you currently using to run your workflows?

Awahab · April 27, 2022, 9:24pm

Hi @ardunn,

thanks for swift correspondence! I am basically launching Fireworks in Python, as per this previous thread. My configuration is same as in the thread, instead the evaluator is now DFT calculation. I am running the following:

    rapidfire(launchpad, nlaunches=100)

Revisiting our previous thread, I saw your following comment:

First, we should clear up exactly what you are timing. It seems from your code the timing is comparing 30 launches of the sequential (non-batch, or batch=1) workflow to 30 launches of the batch=15 workflow.
I’ll assume the batch_size_b=115 is a typo and you meant batch_size_b=15. Your objective function is very fast and has a basically negligible time of evaluation. So what we are really comparing in your example is the timing internally for FireWorks and Rocketsled to process two different workflows.

If the above is correct and what you intended, then the timings are pretty explainable. There are several reasons why single experiments run sequentially take longer than batches.

Sequential experiments run optimizations on every workflow. Batches run optimizations on every batch_size workflow. So if you are running 30 in total, the sequential will run 30 optimizations whereas the batch=15 case will run only 2. In this case, the optimization time is not trivial compared to the objective function (rosenbrock), so the optimization itself is the expensive step. So in one case you’re running 30 computations and in the other you’re really only running 2. This is probably the main reason for the discrepancies in timings.

Submitting workflows to the launchpad and executing them in bulk (as the larger batch size does) is likely more efficient than submitting them and processing them sequentially. Though I wouldn’t expect this to have a large effect, likely maybe a few milliseconds difference in timing.

Now I am using actual DFT calculations so evaluation timings are not negligible as it take up-to 2hr for single DFT calculation to run. I thought that the parallel computation part in Fireworks is abstracted from the user. As batch optimization code in task.py creates workflows of N-batch size, I thought those workflows would run in parallel by Fireworks where as Rocketsled’s optimization would wait for N batches to complete before finding top-N suggestions to run for next batch.

If its not the case then is there something that could be done?

ardunn · April 27, 2022, 11:02pm

So if you are only running the fireworks on one node that rapidfire command you have above will necessarily run them sequentially, not in parallel.

To summarize parallelism in Fireworks/rocketsled:

Rocketsled does not itself manage parallelism apart from the submission and management of workflows. It does not actually itself run workflows. This is managed by fireworks, and you need to call the correct commands to have them run.
Fireworks does abstract parallelism from the user, but you have to still call it with the correct commands to actually pull the correct workflows from the launchpad and actually run them in parallel. There are several ways to use parallelism in fireworks:

One node, running multiple fireworks in parallel. This is if you want multiple workflows to run at the same time on the same computer. The fireworks are managed as multiple processes. You run this with rlaunch multi (command line) or launch_multiprocess (python)
Multiple nodes running one firework: Use this if you have big calculations where each one needs to be parallelized using MPI or OpenMP. Fireworks has documentation for doing this, and you’ll need to configure your calculations to use MPI etc. In this case, once configured, you can run rlaunch on a single node and it will run that single calculation across multiple nodes.
Multiple nodes each running their own firework: Use this if a single node can handle a single calculation but you want to run a bunch of them at the same time. For this you will run rlaunch on each of the nodes independently and they will pull Fws from the launchpad and run them independently.
Multiple instances of calculations where each calculation requires multiple nodes. For example, if you are going to run 5 “big” fireworks, and each needs to be parallelized over 10 nodes, you run rlaunch on the 5 head nodes and Fireworks+MPI/OMP will run these 5 big fireworks across all 50 required nodes. This is a combination of (2) and (3)

There are others as well, like running multiple instances of multiple fireworks on the same node (so kind of like (1) and (3) combined). These can all be done programmatically with submission scripts + cronjobs so as soon as you submit something to the launchpad (whether by rocketsled or manually) your cluster will pull the firework and run them automatically.

Rocketsled doesn’t concern itself with actual running of the parallel workflows (apart from managing which ones are submitted to the launchpad and how that is done), so it can handle all of the above scenarios.

Your specific case

It sounds like what you are trying to do is the (1) scenario. If you want to launch these kinds of fireworks in parallel from the command line, use rlaunch multi; if you want to do it in python, use launch_multiprocess from fireworks.features.multi_launcher . If you want to launch them in a loop, put the launch_multiprocess inside a loop.

That being said, if you are running multiple DFT calculations on a single node I would bet you are going to run into issues (memory, compute problems, etc.). I’d personally recommend running N calculations on N nodes, where each node runs its own single calculation (the scenario in (3) above). To do this, you add workflows to the launchpad (either by rocketsled or manually), then on each node you run them with rapidfire or launch_rocket.

Tagging @computron here if he feels this needs more clarification

Awahab · April 29, 2022, 2:26pm

@ardunn

Allright I shall try this. I believe there might be a confusion of the word “node”. By node I mean individual process/thread of cluster. So having that said I am not planning to launch multiple DFT calculations on a single node. My objective is to launch n-nodes for n-batches. That would make sense for running multiple DFT in a cluster. Usually in cluster we use slurm to do such scheduling to run n-jobs in n-nodes. But since this responsiblility is taken by Fireworks, I am interested in running n-workflows produced by task.py to launch n-fireworks in parallel, the optimizer then waits for them to complete before combining the data, retraining the GP and producing the n-batches again.

In my case its 1 node per firework so (3) applies for my case as you suggested. I shall look into your comments again and get back to you.

ardunn · April 29, 2022, 9:11pm

Hi @Awahab

Here’s a decent overview of some HPC terms including what I mean by “node”: What are standard terms used in HPC? — VSC documentation

Basically a node = 1 compute server.

I’ll be able to help you debug more in depth later next week, so maybe hold tight until then if you are still having problems!!

ardunn · May 12, 2022, 8:53pm

Hey @Awahab, were you able to figure out this problem?

Awahab · May 12, 2022, 11:07pm

Hi @ardunn,

Thanks for getting back. I just arrived from our break followed by ACS conference. I shall look into it this week and get back.

Thanks again!

Awahab · May 31, 2022, 8:45am

Hi @ardunn ,

I had been going through documentation of our HPC, its CRAY XC50. I also re-read your comments and tried to look for examples on launch_multiprocess

I found this link useful as well. All I have to do is to replace rapidfire with launch_multiprocess. I in the meantime, ran this in local PC, I passed the following arugments:

launch_multiprocess(launchpad,FWorker(),nlaunches=10, sleep_time=0, loglvl="CRITICAL", num_jobs=3)

I noticed that this time 3 VASP computations got launched So this is what I was looking for, so if I want to do batch Bayesian optimization with a batch size of 3, I expect 3 actual VASP computations are run in parallel. Now since my local CPU is Xeon E5-2690 v4, 28 threads, in technical terms (single compute node which can run 28 processes) so is launch_multiprocess automatically distributing 3 parallel jobs to whatever the number of processes required to do the 3 jobs (run 3 fireworks in parallel)? So if 3 jobs to run in parallel require say 15 processes, does it automatically handle that?

In case of which has CRAY XC50 464 nodes, launch_multiprocess wont be able to utilize multiple nodes, rather multiple processes in single node, right? If so then option 3 which is Multiple nodes each running their own firework is relevant to my case.

I have marked your solution as resolved as the initial step of running parallel computation in local pc is completed. I just need to do profiling to be sure and know how much time this parallelization is saving me.

ardunn · June 1, 2022, 2:05am

I noticed that this time 3 VASP computations got launched

Glad we are making progress!

so is launch_multiprocess automatically distributing 3 parallel jobs to whatever the number of processes required to do the 3 jobs (run 3 fireworks in parallel)?

Yes. If internally each of those 3 jobs is doing some parallelization (e.g., nested python multiprocessing, OpenMP, or some other thread-based parallelism), each of the three jobs will use that parallelism. This is not something fireworks handles directly but rather says “ok, now it’s time to run this task in this process, so we’ll have the task do whatever it needs to do”.

So if, for example, you have a vasp calculation running 5 OpenMP threads to parallelize across bands, and you’re running 3 of them in parallel, then you get 3x5=15 processes. But again, managing these internal threads is not something fireworks directly does; your calculations should be configured so your scheme for parallelism actually makes sense. Like on a single node you wouldn’t want to run 10 parallel fireworks each having 20 internal threads bc. then you will have 200 threads and your node will probably just lock up.

I would say you would almost certainly not want to use launch_multiprocess on a single node for DFT calculations. I also would guess running more than like 1 DFT calculation on a single node in parallel will net you almost no benefits of parallelization - why? because if your VASP config is properly parallelized, you’ll be using all the cores for a single calculation , and adding more will just make the other calculations wait on the CPU or run out of memory. This will not be the case for (example) 1000 nodes each running one firework - here you would see huge benefits of parallelization.

In case of which has CRAY XC50 464 nodes, launch_multiprocess wont be able to utilize multiple nodes, rather multiple processes in single node, right?

Yes, that is exactly correct. You’ll probably want options (3), or (4), but for simplicity let’s just try (3) first?

For this, you need to just run either rapidfire or launch_rocket in the shell on all the compute nodes simultaneously. Normally this is done by:

(1) Having the workflows ready to go in your FWs launchpad (if you are using rocketsled you’ll probably only need to add the workflows once - the first loop - then they will be added automatically)
(2) Either

(a) Submitting a bunch of compute node requests at once, where each will run a launch_rocket or rapidfire in a batch script. Then each of these will pull and run firework(s)
(b) Having a cron-job or script submit new compute node requests on a regular basis. Then, on some regular basis, you’ll have each compute node requested run a batch script running launch_rocket or rapidfire and your DFT jobs will run in a regular interval. For example, if you are running 100 DFT calculations in a batch in parallel across 100 nodes, and the DFT takes 2 hours each, you might submit 100 new compute node requests every 2 hours. Picking the exact scheme which is right for your problem is up to you, this is just a naive suggestion

Awahab · June 1, 2022, 12:39pm

@ardunn

Looks like I celebrated too soon. I left yesterday running the computations but somewhat the computations were stalled. I never got any results from VASP computations. I thought it could be some sort of deadlock but I did manage to run 3 VASP computations separately by opening 3 terminals.

Running batch VASP computations using launch_multiprocess did not yield any results and the terminal output just mentioned "task_started, no errors or no waiting/sleeping cycles were there as well.

So I decided to run the batch_optimization example that we troubleshooted last time by replacing rapidfire with launch_multiprocess and it did work. I tried 15 total evaluations with 15 sequential iterations using rapidfire vs 5 sequential iterations with batch of 3 num_jobs=3 for launch_multiprocess and the result were following:

Y total evaluations in Sequential runs = Y iterations 
Y total evaluations in Batch runs = Y/n iterations x n batches

Total time for 15 iterations = 1.646 seconds
Total time for 5 iterations x 3 batches = 121.309 seconds

So I thought there might be IPC overhead that batch is taking more time, and hence I tried scaling to see if its indeed the case and following are the results:

for 150 total evaluations

Total time for 150 iterations = 23.260 seconds
Total time for 50 iterations x 3 batches = 129.2 seconds

and then for 600 total evaluations

Total time for 600 iterations = 117.23 seconds
Total time for 200 iterations x 3 batches = 169.275 seconds

and then for 1800 total evaluations

Total time for 1800 iterations = 632.439 seconds
Total time for 600 iterations x 3 batches = 746.63 seconds

I believe that the scaling has to be significantly larger to see the benefits of batch parallelization so I will double the size and report the findings.

(P.S. I shall re-visit your latest comments on running it in multiple nodes)

Awahab · June 2, 2022, 7:46am

@ardunn

I managed to run VASP code using launch_multiprocess, the only way all this works is that the original code should submit 1 workflow, even if we want to run a batch, (So for a batch of 3 we should not add 3 workflows as we did in examples/batch.py, if we do that and also add num_jobs=3 in launch_multiprocess, then the total optimizations would be iterations x 3 x 3. So launch_multiprocess with num_jobs=x, will run x multiples of the iterations. So now I am running benchmark on VASP code and we will have good insight on computation times.

P.S I was trying to find a free profiler to record the threads/processes but still couldn’t get my hands on it.

Awahab · June 5, 2022, 6:46am

Ok, the results for VASP runs are in.

For 40 total evaluations, the sequential optimization was faster then batch optimization!

Sequential rapidfire= 40 iterations = 37569.18 sec
Batch launch_multiprocess = 20 iteration x 2 batches = 38378.57 sec

And as you suggested launch_multiprocess is certainly not recommended. I shall focus on your comments on using multi-nodes and if needed submit a new question.

ardunn · June 11, 2022, 3:46am

Hey @Awahab yes the original workflow function should only submit ONE workflow. Rocketsled internally handles the batching and organization

ardunn · June 11, 2022, 3:55am

Hey @Awahab thanks for reporting back! Yes I am almost certain that if your VASP code is properly parallelized, there will be almost no benefit (if not significant downside, as in your experiment) to running batches in parallel on the same node.

However, if you are running individual VASP jobs each on a single node, you will see a speedup of the number of nodes you are running on. For example, for a batch of 50, if you are running 50 individual calculations each on a single node in parallel (so one node = one calculation), you will see ~50x speedup.

Good luck and please let me know if you have more questions!

The order of things I would try is this:

Run 5 independent calculations on 5 nodes without rocketsled just to make sure you have everything configured ok. Make sure you have the internal VASP parallelism set correctly (e.g., NCORE and OpenMP) - this ensures that each individual node will break up the single calculation into many calculations across all the cores of a single manycore node.
Run a batch_size=5 rocketsled run of the same 5 independent calculations on 5 nodes to make sure your rocketsled is configured ok.
If this is fast enough for you, set a reasonable batch size and run your actual experiment by submitting jobs in a scheduled fashion.
If each of the calculations is still too slow for your liking, look into using OpenMPI for parallelizing individual calculations across nodes. This is more complex, but if EACH of your DFT calculations is very expensive (e.g., >24hr runtime) this can speed things up significantly.

Awahab · June 11, 2022, 11:18am

Hi @ardunn ,

Surprisingly this is exactly what our HPC technician told me. I had a detailed discussion with our HPC technician and he suggested to do this first , without Rocketsled. However, we are not allowed any sudo permissions in our HPC, so I cant run MongoDB there, I was however able to install it using Anaconda.

A workaround suggested by the technician was to run the Rocketsled in my local pc/server where MongoDB can be installed and since Fireworks is just managing the workflow, it wont take much resource if all the management is done in my local server or pc and just offload only VASP calculations to individual nodes in HPC. But since running Fireworks with queue sitll needs MongoDB running in the HPC, I am trying to code a virtual job, that executes only mpirun vasp_std in cluster (without Rocketsled or Fireworks library), copy/pasts the files (OUTCAR) back to the local server/computer and proceed from there . Not sure if its good way of doing things assuming I cant run MongoDB in our HPC.

I shall do what you have suggested in the comments above and should you have any further suggestions given the MongoDB situation, do care to let me know. And since this thread has already been resolved/answered by you. And if I have any questions, I shall start. a new thread.