Qlaunch works fine on one node, but fails on two

mwo · December 23, 2020, 9:52am

Hi,
I have been running workflows on my cluster regularly for testing and everything worked as expected as long as I only run on a single node (2 sockets with 24 core chips in each; capable of hyper-threading, but I do not want that).

However, when I request a job on two nodes, and double the number of processors for my vasp_cmd, I run into problems. The job will start and I get no errors, but all of the MPI processes of VASP run on a single node, while the second one is not doing anything. Using the same vasp_cmd in a “normal” job submitted by hand runs fine on both requested nodes. To be perfectly clear, VASP does start to run, but basically stops doing anything immediately and locks the node up. If I ssh into the second compute node, top shows me that nothing is running.

I have the following package versions:
pymatgen 2020.10.9.1
fireworks 1.9.6
atomate 0.9.5

One problem that I might have is that (for now) I am not able to access a database on the web from the compute nodes normally because of security concerns. I am working with the cluster administration on a solution, but at the moment I am running the database in a conda environment on the compute node, this is why I have the conda and mongod commands in my pre- and post-rocket lines. However, I suspect that this is not the main issue, since the rocket launch is successful, but VASP runs only on one of the two nodes that are requested in the job submission.

As you can see below, I specify both the 48 tasks per node, and the 1 tasks per core (no hyper-threading), so I am not seeing how all the tasks end up on one node.

Here is my_qadapter.yaml:

_fw_name: CommonAdapter
_fw_q_type: SLURM
_fw_template_file: /home/fs71332/mwo4/FireWorks/config/slurm_job_template.txt
rocket_launch: rlaunch -c /home/fs71332/mwo4/FireWorks/config rapidfire --timeout 172800
nodes: 2
walltime: 72:00:00
queue: mem_0096
qos: mem_0096
account: p71332
tasks_per_core: 1
tasks_per_node: 48
job_name: TriboFlow
pre_rocket: module purge; module load intel/19.0.5 intel-mkl/2019.5 openmpi/3.1.4-intel-19.0.5.281-lzrjnd7; conda activate TriboFlow; sleep 60; mongod -f $DATA/mongo/mongod.conf; lpad get_wflows
post_rocket: mongod --shutdown --dbpath $DATA/mongo/data/db; conda deactivate; module purge
logdir: /home/fs71332/mwo4/FireWorks/logs

and my slightly modified slurm template:

#!/bin/bash -l

#SBATCH --nodes=$${nodes}
#SBATCH --ntasks-per-node=$${tasks_per_node}
#SBATCH --time=$${walltime}
#SBATCH --partition=$${queue}
#SBATCH --qos=$${qos}
#SBATCH --ntasks-per-core=$${tasks_per_core}
#SBATCH --account=$${account}
#SBATCH --job-name=$${job_name}
#SBATCH --output=TriboFlow.out
#SBATCH --error=TriboFlow.error

$${pre_rocket}
cd $${launch_dir}
$${rocket_launch}
$${post_rocket}

# CommonAdapter (SLURM) completed writing Template

and here my_fworker.yaml:

name: vsc4
category: ''
query: '{}'
env:
    db_file: /home/fs71332/mwo4/FireWorks/config/db.json
    vasp_cmd: mpirun -np 96 vasp.6_vsc4_std
    scratch_dir: /gpfs/data/fs71332/mwo4/WORK
    vdw_kernel_dir: /home/fs71332/mwo4/vasp_lib/vdw_kernel
    incar_update:
        KPAR: 4
        NCORE: 4

Do I have to somehow change the rocket_launch? Have I forgotten to configure something else? I am very thankful for help and advise,
happy holidays, Michael