I am struggling to execute the magnetic ordering workflow on the Stampede2 cluster. I am able to generate workflows and upload them to the fireworks database. However, I have not been able to run vasp jobs with the magnetic ordering workflow in particular. Other workflows such as a relaxation & static calculation or bandstructure calculations work fine. For context, this workflow was generated from the mp-13 Fe structure. The workflow begins, but then fizzles after a timeout. But, the problem does not seem to be time as the calculation should be quick for iron.
my_qadapter.yaml
_fw_name: CommonAdapter
_fw_q_type: SLURM
rocket_launch: rlaunch -c /home1/09282/devonmse/atomate/config rapidfire
nodes: 1
ntasks_per_node: 64
walltime: 4:00:00
queue: normal
account: TG-MAT210016
job_name: null
mail_type: "START,END"
mail_user: [email protected]
pre_rocket: conda activate mag_order
post_rocket: null
logdir: /home1/09282/devonmse/atomate/logs
In each launcher directory the std_err.txt file repeats the following message:
c418-051.stampede2.tacc.utexas.edu.220217PSM2 can't open hfi unit: -1 (err=23)
[37] MPI startup(): tmi fabric is not available and fallback fabric is not enabled
Additionally, the bottom of each OUTCAR has the following warning:
I am interpreting this error as vasp asking me to just rerun the job where I use the CONTCAR from the faulty job as the POSCAR for a new job. However, it seems like this would be awkward to implement for each firework.