MongoDB/launchpad accessible on login nodes but not compute nodes of HPC

jamesgil · October 27, 2023, 5:12pm

This is very similar to post 44978 (link). I’m encountering a mysterious issue where I am able to access the launchpad/mongoDB via Pymongo’s MongoClient and through lpad on the login nodes of the remote HPC which I run calculations on, but Atomate workflows that are submitted to a compute node fail with a network timeout while trying to connect to MongoDB.

When I log into my HPC on a login/gateway node, I can access MongoDB through pymongo’s MongoClient without any issues and lpad get_wflows returns all my active workflows, as expected. However, when I submit an atomate workflow to the queue using qlaunch, all my calculations fail on network timeout errors from attempting a MongoDB connection.

I opened an interactive shell on a compute node of this HPC, and I can verify that, in fact, when I try to access MongoClient, I get the following network timeout error which matches the reported error returned from submitted calculations (where xxx replaces the IP address of the MongoDB server):

pymongo.errors.ServerSelectionTimeoutError: xxx.xxx.xxx.xxx:27017: timed out, Timeout: 30s, Topology Description: <TopologyDescription id: 653bdf0e420b99dcfb77e5bc, topology_type: Unknown, servers: [<ServerDescription ('xxx.xxx.xxx.xxx', 27017) server_type: Unknown, rtt: None, error=NetworkTimeout('xxx.xxx.xxx.xxx:27017: timed out')>]>

Similarly, if I activate my atomate environment and execute “lpad get_wflows”, I get the following:

ValueError: FireWorks was not able to connect to MongoDB at xxx.xxx.xxx.xxx:27017. Is the server running? The database file specified was /home1/09341/jamesgil/atomate/config/my_launchpad.yaml.

I can confirm that I’ve whitelisted access from all IP addresses to my MongoDB server, and that I activate my atomate environment as a pre-launch step in my_launchpad.yaml.

I have a SUSPICION that this is because of the university firewall. To enable connections from the HPC login nodes to the MongoDB server hosted locally on university premises/networks, I had to explicitly ask for the firewall to be opened for connections from the HPC login nodes to the server IP. However, I wonder if this error is happening because the compute nodes have different IP addresses (which I verified on my HPC user guide), and the firewall is blocking connections from the compute nodes to the university IP. However, I thought that all external communication was handled through the login node. This leads me to an important question about fireworks - is the launchpad connection (when launched via qlaunch) made through the IP address of the login node from which it was launched, or is it made through the compute node that it runs on? How is fireworks supposed to handle communication between the login/compute node and the MongoDB server, or does it not handle it at all? I’m trying to isolate whether this is a fireworks functionality or a functionality of the HPC, and whether or not I need to go through the trouble of opening further firewall connections from the compute node IP addresses to our local MongoDB server. Thanks in advance for your help and insight!

Anubhav_Jain · October 27, 2023, 5:41pm

You’ll likely need to talk with your university cluster administrators. On many machines, what happens is that when a job runs, the initial steps of the job submission script occur on an intermediate node (e.g., MOM node) that has network access. In this case, all the Python portions of the job including database access happens on a node that has network access. However, once an “mpirun” or similar command is initiated, those tasks occur on a compute node without network access, and go back to the intermediate node when the mpi process completes. It is OK that those compute nodes do not have network access since the MPI process is just running VASP, not communicating with the database.

You would need to see if your university cluster operates similarly. If they have a capability for running interactive jobs this can sometimes be probed through that.

jamesgil · October 27, 2023, 5:57pm

Ah, I see. Thank you for your fast response and the explanation - I’ll reach out to the cluster administration. So it sounds like this is not handled by Fireworks at all? In theory, all python processing prior to the MPI/VASP job should be handled on a network-connected node?

Anubhav_Jain · October 27, 2023, 7:14pm

In theory, all python processing prior to the MPI/VASP job should be handled on a network-connected node?

Well technically not all the Python processing, but yes there is processing both before and after the MPI job that runs in Python and requires connection to the server - and usually when we run, all the Python portions are run on such nodes. Even without FireWorks (which also has an “offline” mode), one needs to have a network connection if you simply want a Python job (FireWorks or not) to parse the results and then store them in the database.