Connection error between cluster and MongoDB (on local PC - WSL)

Florian_Gimbert · August 20, 2021, 1:34am

Hi,

I am trying to use MongoDB on local PC (in WSL) while jobs will run on supercomputer. I could install MongoDB on WSL and configured all the necessary files for Fireworks on both local PC and supercomputer.

I open a ssh tunnel between PC and supercomputer by ssh -R 27017:127.0.0.1:27017 user@ipsupercomputer

When creating a workflow on local PC, I can access it without problem on supercomputer with lpad get_wflows

(base) [f-gimbert@atlas ATOMATE_TESTS]$ lpad get_wflows
{
“state”: “READY”,
“name”: “Co–1”,
“created_on”: “2021-08-20T01:00:06.479000”,
“states_list”: “W-W-W-REA”
}

The problem occurs when I am trying to run the job with qlaunch rapidfire (not sure if it is best command).

(base) [f-gimbert@atlas ATOMATE_TESTS]$ qlaunch rapidfire
2021-08-20 10:01:39,204 INFO getting queue adapter
2021-08-20 10:01:39,209 INFO Created new dir /home/f-gimbert/ATOMATE_TESTS/block_2021-08-20-01-01-39-204692
2021-08-20 10:01:39,219 INFO The number of jobs currently in the queue is: 0
2021-08-20 10:01:39,219 INFO 0 jobs in queue. Maximum allowed by user: 0
2021-08-20 10:01:40,946 INFO Launching a rocket!
2021-08-20 10:01:40,957 INFO Created new dir /home/f-gimbert/ATOMATE_TESTS/block_2021-08-20-01-01-39-204692/launcher_2021-08-20-01-01-40-952356
2021-08-20 10:01:40,957 INFO moving to launch_dir /home/f-gimbert/ATOMATE_TESTS/block_2021-08-20-01-01-39-204692/launcher_2021-08-20-01-01-40-952356
2021-08-20 10:01:40,958 INFO submitting queue script
2021-08-20 10:01:40,971 INFO Job submission was successful and job_id is 51108
2021-08-20 10:01:40,971 INFO Sleeping for 5 seconds…zzz…
2021-08-20 10:01:45,984 INFO Launching a rocket!
2021-08-20 10:01:45,995 INFO Created new dir /home/f-gimbert/ATOMATE_TESTS/block_2021-08-20-01-01-39-204692/launcher_2021-08-20-01-01-45-991402
2021-08-20 10:01:45,996 INFO moving to launch_dir /home/f-gimbert/ATOMATE_TESTS/block_2021-08-20-01-01-39-204692/launcher_2021-08-20-01-01-45-991402
2021-08-20 10:01:45,997 INFO submitting queue script
2021-08-20 10:01:46,009 INFO Job submission was successful and job_id is 51109
2021-08-20 10:01:46,010 INFO Sleeping for 5 seconds…zzz…
(I killed here the process)

And when I checked the error file for one launch

(base) [f-gimbert@atlas launcher_2021-08-20-01-01-40-952356]$ more FeXbo_4.e51108
Traceback (most recent call last):
File “/home/f-gimbert/miniconda3/bin/rlaunch”, line 8, in
sys.exit(rlaunch())
File “/home/f-gimbert/miniconda3/lib/python3.7/site-packages/fireworks/scripts/rlaunch_run.py”, line 141, in rlaunch
timeout=args.timeout, local_redirect=args.local_redirect)
File “/home/f-gimbert/miniconda3/lib/python3.7/site-packages/fireworks/core/rocket_launcher.py”, line 98, in rapidfire
while (skip_check or launchpad.run_exists(fworker)) and time_ok():
File “/home/f-gimbert/miniconda3/lib/python3.7/site-packages/fireworks/core/launchpad.py”, line 781, in run_exists
return bool(self._get_a_fw_to_run(query=q, checkout=False))
File “/home/f-gimbert/miniconda3/lib/python3.7/site-packages/fireworks/core/launchpad.py”, line 1074, in _get_a_fw_to_run
sort=sortby)
File “/home/f-gimbert/miniconda3/lib/python3.7/site-packages/pymongo/collection.py”, line 1328, in find_one
for result in cursor.limit(-1):
File “/home/f-gimbert/miniconda3/lib/python3.7/site-packages/pymongo/cursor.py”, line 1238, in next
if len(self.__data) or self._refresh():
File “/home/f-gimbert/miniconda3/lib/python3.7/site-packages/pymongo/cursor.py”, line 1130, in _refresh
self.__session = self.__collection.database.client._ensure_session()
File “/home/f-gimbert/miniconda3/lib/python3.7/site-packages/pymongo/mongo_client.py”, line 1935, in _ensure_session
return self.__start_session(True, causal_consistency=False)
File “/home/f-gimbert/miniconda3/lib/python3.7/site-packages/pymongo/mongo_client.py”, line 1883, in __start_session
server_session = self._get_server_session()
File “/home/f-gimbert/miniconda3/lib/python3.7/site-packages/pymongo/mongo_client.py”, line 1921, in _get_server_session
return self._topology.get_server_session()
File “/home/f-gimbert/miniconda3/lib/python3.7/site-packages/pymongo/topology.py”, line 520, in get_server_session
session_timeout = self._check_session_support()
File “/home/f-gimbert/miniconda3/lib/python3.7/site-packages/pymongo/topology.py”, line 502, in _check_session_support
None)
File “/home/f-gimbert/miniconda3/lib/python3.7/site-packages/pymongo/topology.py”, line 220, in _select_servers_loop
(self._error_message(selector), timeout, self.description))
pymongo.errors.ServerSelectionTimeoutError: localhost:27017: [Errno 111] Connection refused, Timeout: 30s, Topology Description: <TopologyDescription id
: 611efef6c203b3222b57a3ec, topology_type: Single, servers: [<ServerDescription (‘localhost’, 27017) server_type: Unknown, rtt: None, error=AutoReconnec
t(‘localhost:27017: [Errno 111] Connection refused’)>]>

It looks like the connection to MongoDB was not possible while I can read workflows with lpad get_wflows

(base) [f-gimbert@atlas ATOMATE_TESTS]$ lpad get_wflows
{
“state”: “READY”,
“name”: “Co–1”,
“created_on”: “2021-08-20T01:00:06.479000”,
“states_list”: “W-W-W-REA”
}

I tried with -c option for qlaunch but same result. I also used a python script to launch job, no success. I tried also with qlaunch singleshot but error is the same.
I also tried to modify the admin user on MongoDB but nothing changed.

I am lost, so any help is welcome !

Best regards
Florian

My db.json / my_launchpad.yaml files on local PC / supercomputer :

{“host”: “localhost”, “port”: 27017, “database”: “Fireworks”, “collection”: “tasks”, “admin_user”: admin, “admin_password”: password, “readonly_use
r”: user, “readonly_password”: password, “aliases”: {}}

my_launchpad.yaml

host: localhost
logdir: null
mongoclient_kwargs: {}
name: Fireworks
password: password
port: 27017
ssl_ca_file: null
strm_lvl: INFO
user_indices:
username: admin
wf_user_indices:

I am lost.

cosmo · August 22, 2021, 6:57pm

Perhaps the ssh tunnel is not sufficient for the supercomputer nodes to be able to access your MongoDB server. (BTW, Fireworks will need read/write access to MongoDB for the entire workflow run.)

Try setting up your LaunchPad on MongoDB Atlas servers and my_launchpad.yaml files that point to it.

rkingsbury · August 23, 2021, 5:47am

@Florian_Gimbert is there a VPN involved at all in your setup? I know that WSL has some odd and non-obvious VPN issues where not all Windows network adapters talk to WSL. For example, with Cisco VPNs you have to use the client installed from the Windows Store; the one you download separately cannot talk to WSL.

This may be a tangent, but it could be worth doing some googling on ssh tunnels within WSL to see if there are any oddities with how visible WSL is to various network services.

Florian_Gimbert · September 6, 2021, 4:29am

Thank you very much for the different replies. No VPN in my setup, I still don’t understand why lpad get_wflows works fine but not qlaunch singleshot (or any launch job command). Bot call the same launchpad function. I looked more on ssh and WSL but couldn’t find any clue.

For now, I am using again MongoDB Atlas while waiting to try to install MongoDB on a local server.

Anubhav_Jain · September 7, 2021, 6:05pm

Hi Florian,

Skimming the thread, it seems you are OK connecting to MongoDB from both your local computer as well as the “head node” of the supercomputer. The latter because both “get_wflows” and “qlaunch” seem to operate OK from your messages.

When your job actually runs from the scheduler, it wakes up on whatever node got assigned and runs the “rlaunch” command specified in the queue_adapter config file. Two possible problems here are:

Whatever node is actually running the job cannot connect to MongoDB. This may be possible if your tunnel does not go to that compute node. I believe this is basically what @cosmo said. No easy solutions here apart from talking to your supercomputing sysadmins to see if there is a way for your compute nodes to also be able to tunnel or connect to your MongoDB instance via reconfiguring any firewalls, etc.
Within your “my_qadapter.yaml” file, the “rlaunch” command in there is not set up to connect to the correct database and/or doesn’t have the right credentials. While I don’t think this is the case, you should double check that the rlaunch command (and not the qlaunch command) has the correct “-c” flag. See the modifications in step #5 of Launch Rockets through a queue — FireWorks 1.9.7 documentation and please confirm those are set correctly.

Good luck!

Anubhav_Jain · September 7, 2021, 6:11pm

I might also add that one way to debug such issues is to submit an “interactive” job on your supercomputer if their policy allows. An interactive job will allow you to directly type commands on a compute node (vs head / login node). If you have an interactive job session, you can type commands like “lpad get_wflows” and see if the compute node is able to connect. If not, you can start to debug things like config files, firewalls (sometimes this is on the compute node side, sometimes this is because your own database is not configured to properly accepting incoming connections, etc)

Florian_Gimbert · October 25, 2021, 6:40am

Thank you very much for your different advices. I admit I gave up on ssh tunnel configuration after different tests and came back to cloud atlas account which was working fine.

But I am encountering a new problem after a cluster OS update to CentOS7 and my process using remote MongoDB doesn’t work anymore. It looks very similar to the previous problem. I can connect to MongoB from head node with lpag get_wflows but I can’t connect from compute node.
When I try to start a job (with qlaunch singleshot), I have this error :

pymongo.errors.ConfigurationError: The DNS operation timed out after 20.20275568962097 seconds

I suppose with CentOS7 compute node doesn’t have access to network anymore. Since it was working fine with previous CentOS6, configuration files (db.json, my_launchpad.yaml) should be ok.

Do you have already encountered this error ? Or any idea about origin ?

I checked for firewall but it is not running. There is no real admin for the supercomputing, so it is quite difficult to find a solution !

Florian_Gimbert · October 26, 2021, 6:11am

I could find a solution ! It was a problem with gateway ip defined in compute node. I modified it with ip route, not the best solution (need to do it at each reboot) but working at least.

And for ssh tunnel, I will try again later if I have the courage