Hi there,
To extend our previous discussion on running BO in Parallel, I have resolved to the following solution:
- Run whole batch workflow in Login node in HPC because of firewall Compute Nodes cannot access MongoDB
- Submit only the VASP job in each batch to the Compute Nodes via queue
- Wait for computations to finish and obtain the results (the results appear in user directory in login node)
- Perform regular BO and submit next batch jobs
Now I have written a script that waits for all appropriate files to be created and VASP computations to finish for current batch, before proceeding to next phase. So in this way I have mitigated the trouble of dealing with fireworks in offline mode or modifying my BO workflow to the pattern of rlaunch queue
So a valid solution for batch problem would be to launch_multiprocess
in login node, where the job for each of the individual process would be ONLY to create necessary files for VASP execution (pretty light work) and once created, within each batch job created I can do a ScriptTask
to submit VASP job to the queue and each process would just wait for the result and then execute BO as usual. So in this way we can do cheap computations in login node and heavy computations in compute node.
However the test code for launch_multiprocess is giving me following error:
Traceback (most recent call last):
File "/home/abdul/PycharmProjects/pythonProject/venv3/lib/python3.6/site-packages/fireworks/core/rocket.py", line 262, in run
m_action = t.run_task(my_spec)
File "/home/abdul/PycharmProjects/pythonProject/venv3/lib/python3.6/site-packages/rocketsled/task.py", line 221, in run_task
self.pop_lock(manager_id)
File "/home/abdul/PycharmProjects/pythonProject/venv3/lib/python3.6/site-packages/rocketsled/task.py", line 851, in pop_lock
queue = self.c.find_one({"_id": manager_id})["queue"]
TypeError: 'NoneType' object is not subscriptable
The test code is based on our previously worked batch.py
example, I added waiting time to mimic load of computation so I can see if the launcher_2022....
folders are being created simultaneously for a batch (which for launch_multiprocess
they do). However, there seems to be a problem with batch=3
& num_jobs=3
, if I do batch = 2
&num_job=2
then it works otherwise if they both are same I get the error above.
The test code is as follows:
import time
import numpy as np
from fireworks.core.firework import FireTaskBase, Firework, FWAction, Workflow
from fireworks.core.launchpad import LaunchPad
from fireworks.core.rocket_launcher import rapidfire
from fireworks.utilities.fw_utilities import explicit_serialize
from rocketsled.control import MissionControl
from rocketsled.task import OptTask
import datetime
import matplotlib
matplotlib.use('WebAgg')
from fireworks.features.multi_launcher import launch_multiprocess
from fireworks import Firework, LaunchPad, FWorker
# Setting up the FireWorks LaunchPad
launchpad = LaunchPad(name="rsled")
opt_label = "opt_default"
db_info = {"launchpad": launchpad, "opt_label": opt_label}
x_dim = [(-5.0, 5.0), (-5.0, 5.0)]
@explicit_serialize
class RosenbrockTask(FireTaskBase):
_fw_name = "RosenbrockTask"
def run_task(self, fw_spec):
x = fw_spec["_x"]
y = (1 - x[0]) ** 2 + 100 * (x[1] - x[0] ** 2) ** 2
print("STARTING ROSENBROCK SLEEP")
time.sleep(15)
print("ENDING ROSENBROCK SLEEP")
return FWAction(update_spec={"_y": y})
def wf_creator_rosenbrock(x):
spec = {"_x": x}
# ObjectiveFuncTask writes _y field to the spec internally.
firework1 = Firework([RosenbrockTask(), OptTask(**db_info)], spec=spec)
return Workflow([firework1])
if __name__ == "__main__":
mc = MissionControl(**db_info)
todays_date=str(datetime.datetime.now().year)+'-'+str(datetime.datetime.now().month).zfill(2)+'-'+str(datetime.datetime.now().day).zfill(2)
print("todays date",todays_date)
launchpad.reset(password=todays_date, require_password=True)
iteration_size=5
batch_size = 3
launchpad.reset(password=todays_date, require_password=True)
mc.reset(hard=True)
mc.configure(
wf_creator=wf_creator_rosenbrock,
dimensions=x_dim,
predictor="GaussianProcessRegressor",
batch_size=batch_size,
acq='ei',
enforce_sequential=False,
)
for bs in range(batch_size):
hh=np.random.uniform(-5, 5)
jj=np.random.uniform(-5, 5)
launchpad.add_wf(
wf_creator_rosenbrock(
[hh, jj]
)
)
batch_initial = time.time()
# rapidfire(launchpad, nlaunches=(batch_size_b*iteration_size), sleep_time=0)
launch_multiprocess(launchpad,FWorker(),nlaunches=(iteration_size*batch_size), sleep_time=0, loglvl="CRITICAL",num_jobs=3)
batch_final = time.time()
time_b=batch_final - batch_initial
print("Time for Batch of {} vs {}".format(batch_size,time_b))