This is very similar to post 44978 (link). I’m encountering a mysterious issue where I am able to access the launchpad/mongoDB via Pymongo’s MongoClient and through lpad on the login nodes of the remote HPC which I run calculations on, but Atomate workflows that are submitted to a compute node fail with a network timeout while trying to connect to MongoDB.
When I log into my HPC on a login/gateway node, I can access MongoDB through pymongo’s MongoClient without any issues and lpad get_wflows returns all my active workflows, as expected. However, when I submit an atomate workflow to the queue using qlaunch, all my calculations fail on network timeout errors from attempting a MongoDB connection.
I opened an interactive shell on a compute node of this HPC, and I can verify that, in fact, when I try to access MongoClient, I get the following network timeout error which matches the reported error returned from submitted calculations (where xxx replaces the IP address of the MongoDB server):
pymongo.errors.ServerSelectionTimeoutError: xxx.xxx.xxx.xxx:27017: timed out, Timeout: 30s, Topology Description: <TopologyDescription id: 653bdf0e420b99dcfb77e5bc, topology_type: Unknown, servers: [<ServerDescription ('xxx.xxx.xxx.xxx', 27017) server_type: Unknown, rtt: None, error=NetworkTimeout('xxx.xxx.xxx.xxx:27017: timed out')>]>
Similarly, if I activate my atomate environment and execute “lpad get_wflows”, I get the following:
ValueError: FireWorks was not able to connect to MongoDB at xxx.xxx.xxx.xxx:27017. Is the server running? The database file specified was /home1/09341/jamesgil/atomate/config/my_launchpad.yaml.
I can confirm that I’ve whitelisted access from all IP addresses to my MongoDB server, and that I activate my atomate environment as a pre-launch step in my_launchpad.yaml.
I have a SUSPICION that this is because of the university firewall. To enable connections from the HPC login nodes to the MongoDB server hosted locally on university premises/networks, I had to explicitly ask for the firewall to be opened for connections from the HPC login nodes to the server IP. However, I wonder if this error is happening because the compute nodes have different IP addresses (which I verified on my HPC user guide), and the firewall is blocking connections from the compute nodes to the university IP. However, I thought that all external communication was handled through the login node. This leads me to an important question about fireworks - is the launchpad connection (when launched via qlaunch) made through the IP address of the login node from which it was launched, or is it made through the compute node that it runs on? How is fireworks supposed to handle communication between the login/compute node and the MongoDB server, or does it not handle it at all? I’m trying to isolate whether this is a fireworks functionality or a functionality of the HPC, and whether or not I need to go through the trouble of opening further firewall connections from the compute node IP addresses to our local MongoDB server. Thanks in advance for your help and insight!