Batch Optimization in Rocketsled for HPC

Awahab · July 5, 2022, 8:49am

Hi there,

To extend our previous discussion on running BO in Parallel, I have resolved to the following solution:

Run whole batch workflow in Login node in HPC because of firewall Compute Nodes cannot access MongoDB
Submit only the VASP job in each batch to the Compute Nodes via queue
Wait for computations to finish and obtain the results (the results appear in user directory in login node)
Perform regular BO and submit next batch jobs

Now I have written a script that waits for all appropriate files to be created and VASP computations to finish for current batch, before proceeding to next phase. So in this way I have mitigated the trouble of dealing with fireworks in offline mode or modifying my BO workflow to the pattern of rlaunch queue

So a valid solution for batch problem would be to launch_multiprocess in login node, where the job for each of the individual process would be ONLY to create necessary files for VASP execution (pretty light work) and once created, within each batch job created I can do a ScriptTask to submit VASP job to the queue and each process would just wait for the result and then execute BO as usual. So in this way we can do cheap computations in login node and heavy computations in compute node.

However the test code for launch_multiprocess is giving me following error:

Traceback (most recent call last):
  File "/home/abdul/PycharmProjects/pythonProject/venv3/lib/python3.6/site-packages/fireworks/core/rocket.py", line 262, in run
    m_action = t.run_task(my_spec)
  File "/home/abdul/PycharmProjects/pythonProject/venv3/lib/python3.6/site-packages/rocketsled/task.py", line 221, in run_task
    self.pop_lock(manager_id)
  File "/home/abdul/PycharmProjects/pythonProject/venv3/lib/python3.6/site-packages/rocketsled/task.py", line 851, in pop_lock
    queue = self.c.find_one({"_id": manager_id})["queue"]
TypeError: 'NoneType' object is not subscriptable

The test code is based on our previously worked batch.py example, I added waiting time to mimic load of computation so I can see if the launcher_2022.... folders are being created simultaneously for a batch (which for launch_multiprocess they do). However, there seems to be a problem with batch=3 & num_jobs=3, if I do batch = 2 &num_job=2 then it works otherwise if they both are same I get the error above.

The test code is as follows:

import time
import numpy as np
from fireworks.core.firework import FireTaskBase, Firework, FWAction, Workflow
from fireworks.core.launchpad import LaunchPad
from fireworks.core.rocket_launcher import rapidfire
from fireworks.utilities.fw_utilities import explicit_serialize

from rocketsled.control import MissionControl
from rocketsled.task import OptTask

import datetime
import matplotlib
matplotlib.use('WebAgg')
from fireworks.features.multi_launcher import launch_multiprocess
from fireworks import Firework, LaunchPad, FWorker

# Setting up the FireWorks LaunchPad
launchpad = LaunchPad(name="rsled")
opt_label = "opt_default"
db_info = {"launchpad": launchpad, "opt_label": opt_label}
x_dim = [(-5.0, 5.0), (-5.0, 5.0)]

@explicit_serialize
class RosenbrockTask(FireTaskBase):
    _fw_name = "RosenbrockTask"

    def run_task(self, fw_spec):
        x = fw_spec["_x"]
        y = (1 - x[0]) ** 2 + 100 * (x[1] - x[0] ** 2) ** 2
        print("STARTING ROSENBROCK SLEEP")
        time.sleep(15)
        print("ENDING ROSENBROCK SLEEP")
        return FWAction(update_spec={"_y": y})


def wf_creator_rosenbrock(x):
    spec = {"_x": x}
    # ObjectiveFuncTask writes _y field to the spec internally.
    firework1 = Firework([RosenbrockTask(), OptTask(**db_info)], spec=spec)
    return Workflow([firework1])


if __name__ == "__main__":
    mc = MissionControl(**db_info)
    todays_date=str(datetime.datetime.now().year)+'-'+str(datetime.datetime.now().month).zfill(2)+'-'+str(datetime.datetime.now().day).zfill(2)
    print("todays date",todays_date)
    launchpad.reset(password=todays_date, require_password=True)

    iteration_size=5
    batch_size = 3

    launchpad.reset(password=todays_date, require_password=True)
    mc.reset(hard=True)
    mc.configure(
        wf_creator=wf_creator_rosenbrock,
        dimensions=x_dim,
        predictor="GaussianProcessRegressor",
        batch_size=batch_size,
        acq='ei',
        enforce_sequential=False,
    )

    for bs in range(batch_size):
        hh=np.random.uniform(-5, 5)
        jj=np.random.uniform(-5, 5)
        launchpad.add_wf(
            wf_creator_rosenbrock(
                [hh, jj]
            )
        )
    batch_initial = time.time()
    # rapidfire(launchpad, nlaunches=(batch_size_b*iteration_size), sleep_time=0)
    launch_multiprocess(launchpad,FWorker(),nlaunches=(iteration_size*batch_size), sleep_time=0, loglvl="CRITICAL",num_jobs=3)

    batch_final = time.time()
    time_b=batch_final - batch_initial

    print("Time for Batch of {} vs {}".format(batch_size,time_b))

ardunn · July 5, 2022, 7:19pm

Hi @Awahab good to see you are making progress with this!

That error basically means something is not right with the optimization collection. It’s probably one of 3 things:

Some parameters of the DB were unknowingly changed and now it’s accessing an empty collection as the optimization collection. For example, the collection used to be known as opt or something and then it was accidentally renamed to opt_db. I suggest it might be this because I have done this a lot LOL
The opt collection got corrupted somehow (less likely, but possible)
It could be a bug I have not yet seen.

Let me look into your test code and see if it is the third possibility. In the meantime, could you just confirm the names of the collections on your live run are correct, and then possibly do a MissionControl.reset?

Awahab · July 5, 2022, 10:49pm

@ardunn

Alright I shall check the names and do reset update you.

Also just to add to this I was able to make it functional by changing this line enforce_sequential=False, to True. Anyways, the behaviour is still somewhat undesirable because:

For say, batch_size = 3 → The code does produce 3 jobs to be run concurrently and I can see 3 launcher_2022.... folders in the directory. These first 3 jobs start and finish more or less at the same time (so far so good). However, problem arises after these 3 jobs, as I get erratic pattern i.e after completion of these 3 batch jobs concurrently, it produces just 1 job, and after its completion, it produces 2 more jobs concurrently etc. I will provide a screenshot tomorrow when i go back the the office to explain this point better.
Secondly, for say 5 iterations and 3 batches, I expect total of 15 function evaluations which is what I get in rapidfire, but in launch_multiprocess it does 5 x 3 x 3 = 45 evaluations