Failed to execute the example run

Anton_F · January 13, 2023, 5:35am

Dear community!
I’ve sucessfully configured and installed Atomate. However, when I try to run a test calculation described here Installing atomate — atomate 1.0.3 documentation, in particular, when I run qlaunch rapidfire -m 1, an error occurs printed to FW.job-.error file:

Traceback (most recent call last):
  File "/home1/theory/fil/atomate/atomate_env/bin/rlaunch", line 8, in <module>
    sys.exit(rlaunch())
  File "/home1/theory/fil/atomate/atomate_env/lib/python3.9/site-packages/fireworks/scripts/rlaunch_run.py", line 160, in rlaunch
    rapidfire(
  File "/home1/theory/fil/atomate/atomate_env/lib/python3.9/site-packages/fireworks/core/rocket_launcher.py", line 106, in rapidfire
    while (skip_check or launchpad.run_exists(fworker)) and time_ok():
  File "/home1/theory/fil/atomate/atomate_env/lib/python3.9/site-packages/fireworks/core/launchpad.py", line 900, in run_exists
    return bool(self._get_a_fw_to_run(query=q, checkout=False))
  File "/home1/theory/fil/atomate/atomate_env/lib/python3.9/site-packages/fireworks/core/launchpad.py", line 1173, in _get_a_fw_to_run
    m_fw = self.fireworks.find_one(m_query, {"fw_id": 1, "spec": 1}, sort=sortby)
  File "/home1/theory/fil/atomate/atomate_env/lib/python3.9/site-packages/pymongo/collection.py", line 1459, in find_one
    for result in cursor.limit(-1):
  File "/home1/theory/fil/atomate/atomate_env/lib/python3.9/site-packages/pymongo/cursor.py", line 1248, in next
    if len(self.__data) or self._refresh():
  File "/home1/theory/fil/atomate/atomate_env/lib/python3.9/site-packages/pymongo/cursor.py", line 1139, in _refresh
    self.__session = self.__collection.database.client._ensure_session()
  File "/home1/theory/fil/atomate/atomate_env/lib/python3.9/site-packages/pymongo/mongo_client.py", line 1740, in _ensure_session
    return self.__start_session(True, causal_consistency=False)
  File "/home1/theory/fil/atomate/atomate_env/lib/python3.9/site-packages/pymongo/mongo_client.py", line 1685, in __start_session
    self._topology._check_implicit_session_support()
  File "/home1/theory/fil/atomate/atomate_env/lib/python3.9/site-packages/pymongo/topology.py", line 538, in _check_implicit_session_support
    self._check_session_support()
  File "/home1/theory/fil/atomate/atomate_env/lib/python3.9/site-packages/pymongo/topology.py", line 554, in _check_session_support
    self._select_servers_loop(
  File "/home1/theory/fil/atomate/atomate_env/lib/python3.9/site-packages/pymongo/topology.py", line 238, in _select_servers_loop
    raise ServerSelectionTimeoutError(
pymongo.errors.ServerSelectionTimeoutError: localhost:27017: [Errno 111] Connection refused, Timeout: 30s, Topology Description: <TopologyDescription id: 63be813ab4bac8a04a1736e2, topology_type: Unknown, servers: [<ServerDescription ('localhost', 27017) server_type: Unknown, rtt: None, error=AutoReconnect('localhost:27017: [Errno 111] Connection refused')>]>

The status of the flow remains “ready”.
As I understand, there is a problem with connection to the Mongo database. However, I have configured everything regarding Mongo and the server indeed runs locally on the default 27017 port. Moreover, the test described on the above mentioned webpage (via x = VaspCalcDb.from_db_file(“db.json”) x.reset() ) also finishes successfully. Please advice me, what else can I check to understand the reason of the problem?

fraricci · January 13, 2023, 5:54pm

Hi Anton,

could you post your launchpad.yaml ?

Anton_F · January 13, 2023, 6:23pm

Hi, Francesco,
Sure

host: 127.0.0.1
port: 27017
name: atomate
username: admin
password: pass
ssl_ca_file: null
logdir: null
strm_lvl: INFO
user_indices: []
wf_user_indices: []

I also tried to change 127.0.0.1 to localhost but it doesn’t change the things.

Zhuoying · January 14, 2023, 12:41am

Hi Anton,
Usually, we put MongoDB in a supercomputer server, which you can easily check from a MongoDB GUI, such as Robo3T. Here are things you might want to try to solve your problem:

Check your localhost connection status: whether the db (“atomate”) is running.
Do export FW_CONFIG_FILE='/path_to_config/FW_config.yaml' before qlaunch
If you installed atomate in a conda env, conda activate <your atomate env> first.

Anton_F · January 14, 2023, 6:33am

Hi, Zhuoying,
Thank you for the tips. I’ve double checked, all the settings look OK, but the error remains. Now I’ve even checked that I can authorize the database from pymongo. The only doubt I have - is the code from …/atomate_env/lib/python3.9/site-packages/pymongo/topology.py, which causes the error, is run from the headnode of the cluster (where MongoDB is launched) or it can be called by the code from a node? In the latter case, could be the reason of the error that the database hosted on the headnode is not accesible from another node?

firaty · February 16, 2023, 3:38pm

I don’t know if you’ve solved the problem but I think your intuition is correct. For the HPC we run our calculations on, compute nodes can not access MongoDB servers running on login nodes, and the rlaunch can not access the launchpad as a result. The reason why your test ran was most likely because you ran it on the same computer you had MongoDB daemon running. You can test if this is the case by connecting to an interactive session (using salloc with slurm, for instance, and then ssh-ing into the node), and trying to run not with qlaunch but with rlaunch after starting the mongo daemon on the compute node itself.

Workarounds are either running FireWorks in offline mode, which comes with some drawbacks, or asking the IT staff of your cluster to allow access from compute nodes to either login nodes, or better yet, to a cloud-hosted MongoDB service such as MongoDB Atlas. We went with the latter which is the perfect solution, so that now we can submit jobs with qlaunch to multiple nodes, and each just accesses our MongoDB Atlas cluster.

Anton_F · February 18, 2023, 6:01pm

Dear Firaty!
Yes, I’ve checked that the Mongo is inacessible from a node. Thank you for your suggestions, in particular, on running Atomate in offline mode. Unfortunatelly, trying the latter, I got an error " More processors requested than permitted" although my_fworker.yaml is correct. If you faced such an issue, please give an advice.

Best regards,
Anton.

firaty · February 20, 2023, 10:40am

Hey @Anton_F,

That sounds like a slurm error, for which you might want to take a look at your my_qadapter.yaml file instead. You might be setting your tasks_per_node to something that is not compatible with the QOS/partition you’re submitting your jobs to.

Best,

fengzimin · February 21, 2023, 5:31am

hello anton,
not sure if we had the same problem, but we surely saw the same error messages.
i resolved it when i found out that i had restricted visits from localhost only and used --bind_ip_all option for mongod.
i started daemon on the login node and needed for the compute nodes to access it.
hth
fzm

Anton_F · March 14, 2023, 2:26pm

Dear Firaty,
Please have a look at my my_qadapter.yaml and my_fworker.yaml - for me everything looks correct and the VASP command works perfectly if I submit it from the command line. I replaced the full adresses by … .

_fw_name: CommonAdapter
_fw_q_type: SLURM
rocket_launch: rlaunch -c .../atomate/config rapidfire
nodes: 2
walltime: 24:00:00
queue: null
account: null
job_name: null
pre_rocket: source .../atomate/atomate_env/bin/activate
post_rocket: null
logdir: .../atomate/logs

name: Main_Worker
category: ''
query: '{}'
env:
    db_file: .../atomate/config/db.json
    vasp_cmd: srun -n 8 --exclusive --error=err.log -o out .../vasp_std
    scratch_dir: null

If you notice anything incorrect, please let me know as I still can’t get rid of the error " More processors requested than permitted"

Best regards,
Anton.

firaty · March 17, 2023, 10:48am

Hey @Anton_F,

When you say your VASP command works perfectly, do you mean when running VASP directly or with the rlaunch command? If it’s the latter, then I’d say your my_fworker.yaml file is configured correctly. It would also mean there may be something wrong with your my_qadapter.yaml file which only comes into play when you use qlaunch. When you run qlaunch, a directory with a block_ prefix will be generated with directories prefixed with launcher_ inside. You could go into one after trying to run qlaunch to check the file FW_submit.script to see if it contains everything slurm needs to run a job.

Also, I don’t see a job template file in your my_qadapter.yaml, which might be an issue. Please take a look at this page on qadapter programming if you haven’t already.

Best,