VasprunXMLValidator error in the middle of a workflow

Hello,

I have been trying to get the wf_elastic_constant() working on the supercomputing cluster I use (TACC Stampede2), but have faced a combination of struggles, the latest being this VasprunXMLValidator error in the middle of the workflow. I have successfully submitted and analyzed the results from a bandstructure workflow and a structure optimization workflow on this cluster, so I am certain that VASP and MongoDB work properly. However, I have run into greater issues with more complex workflows. Just to provide some context, my first problem that I ran into with the wf_elastic_constant() workflow was an IMPI environment failure:
in the vasp.out.gz:

c401-001.stampede2.tacc.utexas.edu.61754hfp_gen1_context_open: hfi_userinit_internal: failed, trying again (1/3)
c401-001.stampede2.tacc.utexas.edu.61754hfi_userinit_internal: assign_context command failed: Device or resource busy
c401-001.stampede2.tacc.utexas.edu.61754hfp_gen1_context_open: hfi_userinit_internal: failed, trying again (2/3)
c401-001.stampede2.tacc.utexas.edu.61754hfi_userinit_internal: assign_context command failed: Device or resource busy
c401-001.stampede2.tacc.utexas.edu.61754hfp_gen1_context_open: hfi_userinit_internal: failed, trying again (3/3)
TACC:  MPI job exited with code: 254
TACC:  Shutdown complete. Exiting.

and in the std_err.gz

c401-001.stampede2.tacc.utexas.edu.61737PSM2 can't open hfi unit: -1 (err=23)
c401-001.stampede2.tacc.utexas.edu.61757PSM2 can't open hfi unit: -1 (err=23)
c401-001.stampede2.tacc.utexas.edu.61770PSM2 can't open hfi unit: -1 (err=23)
[12] MPI startup(): tmi fabric is not available and fallback fabric is not enabled
[1] MPI startup(): tmi fabric is not available and fallback fabric is not enabled
[23] MPI startup(): tmi fabric is not available and fallback fabric is not enabled

I am working with a colleague who works on the same cluster as I do and faced the exact same issue, but he is working with the magnetic orderings workflow (He has posted here). Both of us find that this error seems to occur during the beginning stages of a complex workflow, but I have found it to occur in two distinct places:

  • During the structure optimization, after custodian corrects for INCAR parameters twice (first, changing ISMEAR = -5 to ISMEAR = 2, SIGMA=0.2 for metals through the LargeSigmaHandler, following by a reduction of SIGMA to 0.14 by the IncorrectSigmaHandler) The structure optimization completes sucessfully in the first round with ISMEAR = -5, but the VASP calculation is resubmitted by custodian with ISMEAR = 2 and SIGMA = 0.2. In the middle of the second calculation, the optimization stops abruptly (presumably because the entropy term is too large, triggering the LargeSigmaHandler), and a third calculation begins with SIGMA = 0.14. During this calculation, the above error was triggered.

  • During the start of the deformation steps - this was during a run where, prior to submitting the workflow, I altered the ISMEAR and SIGMA parameters preemptively to be better suited for metals so that custodian would not need to operate on the structure optimization. This time, the structure optimization completed, but the above error was triggered by the deformation steps (of which there are 23).

Based on this nature, I believe the error is related to overloading a compute node either by the number of tasks assigned to it or by the memory being overloaded.

Between messaging with the TACC staff and atomate developers, we were able to hypothesize that this error has either been caused by more than 48 tasks being assigned to the same compute node or due to the VASP environment not propagating correctly to new compute shells, resulting in a corrupted IMPI environment. I tried to resolve this issue by setting my qadaptor.yaml as follows:


_fw_name: CommonAdapter

_fw_q_type: SLURM

rocket_launch: rlaunch -c /home1/09341/jamesgil/atomate/config rapidfire --max_loops 3

nodes: 1

ntasks_per_node: 20

walltime: 48:00:00

queue: normal

account: TG-MAT210016

job_name: null

mail_type: "START,END"

mail_user: [email protected]

pre_rocket: conda activate atomate_env2; module load intel/17.0.4; module load impi/17.0.3; module load vasp/6.3.0

post_rocket: conda deactivate; module purge

logdir: /home1/09341/jamesgil/atomate/logs

The changes I made were as follows:

  • I set my ntasks_per_node to 20 to prevent overloading a single node
  • I added “module load” statements for all the modules i need to run vasp on a compute node as a pre-firework task, ensuring that any new shell opened to run a job script will load these modules

Sorry for the extensive background, but I want to give some proper context before introducing my latest error:

After making these changes, I am now getting a VasprunXMLValidator error in my FW_job-#####.error file (this is running the wf_elastic_constant() with pure metallic vanadium as the structure input (taken from materials project). My two global output files are as follows:

FW_job-######.out:

023-06-12 10:42:36,272 INFO Hostname/IP lookup (this will take a few seconds)
2023-06-12 10:43:23,531 INFO Created new dir /scratch/09341/jamesgil/atomate_test/elasticRegPrintVASPcmd/launcher_2023-06-12-15-43-23-528355
2023-06-12 10:43:23,534 INFO Launching Rocket
2023-06-12 10:43:24,204 INFO RUNNING fw_id: 551 in directory: /scratch/09341/jamesgil/atomate_test/elasticRegPrintVASPcmd/launcher_2023-06-12-15-43-23-528355
2023-06-12 10:43:24,417 INFO Task started: FileWriteTask.
2023-06-12 10:43:24,425 INFO Task completed: FileWriteTask
2023-06-12 10:43:24,504 INFO Task started: {{atomate.vasp.firetasks.write_inputs.WriteVaspFromIOSet}}.
2023-06-12 10:43:24,837 INFO Task completed: {{atomate.vasp.firetasks.write_inputs.WriteVaspFromIOSet}}
2023-06-12 10:43:24,917 INFO Task started: {{atomate.vasp.firetasks.run_calc.RunVaspCustodian}}.
vasp command loaded into custodian:  ['ibrun', 'vasp_std']
2023-06-12 11:03:04,603 INFO Rocket finished
2023-06-12 11:03:05,065 INFO Sleeping for 60 secs
2023-06-12 11:04:05,126 INFO Checking for FWs to run...
2023-06-12 11:04:05,389 INFO Sleeping for 60 secs
2023-06-12 11:05:05,450 INFO Checking for FWs to run...
2023-06-12 11:05:05,745 INFO Sleeping for 60 secs
2023-06-12 11:06:05,806 INFO Checking for FWs to run...

NOTE: the line "vasp command loaded into custodian: " is a print statement I wrote in atomate/vasp/firetasks/run_calc.py to see what vasp command was actuallly passed into custodian, as per the recommandation here

FW_job-######.error:

Traceback (most recent call last):
  File "/home1/09341/jamesgil/mambaforge/envs/atomate_env2/lib/python3.9/site-packages/custodian/vasp/validators.py", line 36, in check
    Vasprun("vasprun.xml")
  File "/home1/09341/jamesgil/mambaforge/envs/atomate_env2/lib/python3.9/site-packages/pymatgen/io/vasp/outputs.py", line 368, in __init__
    self._parse(
  File "/home1/09341/jamesgil/mambaforge/envs/atomate_env2/lib/python3.9/site-packages/pymatgen/io/vasp/outputs.py", line 481, in _parse
    raise ex
  File "/home1/09341/jamesgil/mambaforge/envs/atomate_env2/lib/python3.9/site-packages/pymatgen/io/vasp/outputs.py", line 396, in _parse
    for event, elem in ET.iterparse(stream):
  File "/home1/09341/jamesgil/mambaforge/envs/atomate_env2/lib/python3.9/xml/etree/ElementTree.py", line 1253, in iterator
    yield from pullparser.read_events()
  File "/home1/09341/jamesgil/mambaforge/envs/atomate_env2/lib/python3.9/xml/etree/ElementTree.py", line 1324, in read_events
    raise event
  File "/home1/09341/jamesgil/mambaforge/envs/atomate_env2/lib/python3.9/xml/etree/ElementTree.py", line 1296, in feed
    self._parser.feed(data)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1037, column 0
ERROR:custodian.custodian:Validation failed: VasprunXMLValidator
Traceback (most recent call last):
  File "/home1/09341/jamesgil/mambaforge/envs/atomate_env2/lib/python3.9/site-packages/fireworks/core/rocket.py", line 261, in run
    m_action = t.run_task(my_spec)
  File "/home1/09341/jamesgil/mambaforge/envs/atomate_env2/lib/python3.9/site-packages/atomate/vasp/firetasks/run_calc.py", line 295, in run_task
    c.run()
  File "/home1/09341/jamesgil/mambaforge/envs/atomate_env2/lib/python3.9/site-packages/custodian/custodian.py", line 382, in run
    self._run_job(job_n, job)
  File "/home1/09341/jamesgil/mambaforge/envs/atomate_env2/lib/python3.9/site-packages/custodian/custodian.py", line 504, in _run_job
    raise ValidationError(s, True, v)
custodian.custodian.ValidationError: Validation failed: VasprunXMLValidator
INFO:rocket.launcher:Rocket finished
DEBUG:launchpad:Aggregation '[{'$match': {'$or': [{'spec._fworker': {'$exists': False}}, {'spec._fworker': None}, {'spec._fworker': 'Stampede2_skx-normal'}], 'spec._category': {'$exists': False}, 'state': {'$in': ['RUNNING', 'RESERVED']}}}, {'$project': {'fw_id': True, '_id': False}}]'.
INFO:rocket.launcher:Sleeping for 60 secs
INFO:rocket.launcher:Checking for FWs to run...
DEBUG:launchpad:Aggregation '[{'$match': {'$or': [{'spec._fworker': {'$exists': False}}, {'spec._fworker': None}, {'spec._fworker': 'Stampede2_skx-normal'}], 'spec._category': {'$exists': False}, 'state': {'$in': ['RUNNING', 'RESERVED']}}}, {'$project': {'fw_id': True, '_id': False}}]'.
INFO:rocket.launcher:Sleeping for 60 secs
INFO:rocket.launcher:Checking for FWs to run...
DEBUG:launchpad:Aggregation '[{'$match': {'$or': [{'spec._fworker': {'$exists': False}}, {'spec._fworker': None}, {'spec._fworker': 'Stampede2_skx-normal'}], 'spec._category': {'$exists': False}, 'state': {'$in': ['RUNNING', 'RESERVED']}}}, {'$project': {'fw_id': True, '_id': False}}]'.
INFO:rocket.launcher:Sleeping for 60 secs
INFO:rocket.launcher:Checking for FWs to run...

As you can see, the problem starts because of a malformed vasprun.xml, which then cannot be read or validated.

The following is the FW_submit.script

#!/bin/bash -l

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=20
#SBATCH --time=48:00:00
#SBATCH --partition=normal
#SBATCH --account=TG-MAT210016
#SBATCH --job-name=FW_job
#SBATCH --output=FW_job-%j.out
#SBATCH --error=FW_job-%j.error
#SBATCH --mail-type=START,END
#SBATCH [email protected]


conda activate atomate_env2; module load intel/17.0.4; module load impi/17.0.3; module load vasp/6.3.0
cd /scratch/09341/jamesgil/atomate_test/elasticRegPrintVASPcmd
rlaunch -c /home1/09341/jamesgil/atomate/config rapidfire --max_loops 3
conda deactivate; module purge

# CommonAdapter (SLURM) completed writing Template

I submit my workflows using:

qlaunch singleshot

If enter the resulting launcher folder created, it looks as follows:

error.1.tar.gz contains a completed vasp run set to ISMEAR = -5. error.2.tar.gz contains a partially-complete vasp run with ISMEAR = 2 and SIGMA = 0.2. This calculation appears to stop apruptly in the middle of it, presumably because LargeSigmaHandler steps in and adjusts the SIGMA value. Finally, the calculation that is encoded in the main folder is that with ISMEAR = 2 and SIGMA = 0.14. The vasp run appears to complete sucessfully (based on the OUTCAR and vasp.out), but the structure optimization step still shows as a fizzled firework. There is nothing printed to the std_err.txt file here, nor any indication of the vasprunXMLValidator error in the vasp.out or OUTCAR.

I appreciate the patience of whoever reads this ungodly post - I’m sorry its so long. I want to give as much context and information as possible, and if I can supply any additional information that may help in troubleshooting, I’m happy to pass it along.

Does anyone have any ideas of how I should move forward?

As a followup, is there a way to specify in a workflow that VASP jobs must not overlap on the same node? I have been in contact with the help desk of the HPC I’m running atomate on, and they believe that the issue is related to MPI jobs overlapping on the same node. However, I’m not sure how to make sure the workflow does not launch overlapping VASP jobs…

Do you still have a problem if you use rlaunch singleshot in the qadaptor.yaml? (With appropriate config, etc)

Rlaunch rapidfire will submit a ton of VASP jobs on your nodes, which would cause them to not fit on available resources. Rlaunch singleshot will just submit one VASP job which should fit. If you must use rapidfire you need to do proper accounting for the number of jobs being submitted.

Hi all,

I think I know what is the reason for the errors and the overloading since I am probably responsible for it.

This has probably to do with how custodian kills vasp. This used to be done by a killall command that terminated all running vasp processes. This was problematic for many jobs running on a single node (i.e. using rlaunch multi), which I tried to fix here.

However, this is an unreliable way to kill vasp on multiple nodes, since it “only” kills the process group of the mpirun (or srun, mpiexec,…) command and not the vasp processes themselves and the scheduler will potentially change process groups and not release the resources. So custodian kills vasp, starts it again with modified inputs, but since there are still some old vasp processes around that did not die in time, you get Device or resource busy errors.

This should now be fixed in the new custodian release, which again targets vasp processes directly (If one vasp process dies, it will send out a kill signal to all others!). A detailed discussion can be found on the github page, PR #254

So I would advise to update custodian, and then check if the error persists.

Another possibility is that vasp actually terminates correctly, but there are some warnings in the output. E.g. “EDDRMM: call to ZHEGV failed” which will also trip up custodian. I have seen some strange behavior with this kind of things as well. Please let me know how your custodian output looks like.

Cheers, Michael

Thank you so much for this insight. This definitely sounds like it may be the issue. However, I experienced a compatibility issue with the most recent custodian update, which I posted about here. The MaxForceHandler is deprecated, but atomate continues to call on it, so I cannot get any atomate jobs to run using the May 12th version of custodian (unless there is an even more recent release that addresses this). Do you know anything about this or if further updates have been made on this regard?

I am on vacation and am unable to validate this it at the moment, but I will get back with you as soon as I am back mid-July! Again, thank you sincerely for your response.

I’ll try this - thank you Anubhav! At least if I can get one VASP job on a compute node, then jobs shouldn’t overlap. However, will this allow proper communication between VASP calculations for a large atomate workflow like wf_elastic_constant? Also, I was under the impression that singleshot submitted a single firework, which should in theory submit one VASP job, but i get failure during the same firework when custodian makes changes and restarts. I do have this concern even when using singleshot.

Unfortunately, I am on vacation until mid-July, so I won’t be able to try this until then, but I’ll get back with you as soon as I can!

On your last note, when using rapidfire, how do I properly account for the number of jobs being submitted? Would you mind elaborating on this a bit? Thank you so much.

Hi @jamesgil ,
there is actually even more discussion going on at the moment. Some other bugs are related to this and @firaty has done a deep dive to understand what is going on and how to fix it. See his PR for more info and ongoing discussions.

I also learned from him (same office) that this incompatibility with atomate exists after I made my last comment here. I will try to take a look in the next days if I can get to it, so if you are lucky, this issue will be solved once you come back from holiday :slight_smile:.

I looked through @firaty 's PR chain that you linked and I’m thrilled to see the progress! That sounds exactly like the problem I’m encountering, and I’m really eager to try to newest version to see if it resolves my errors. My only issue now is that I’m unable to use the most updated version of custodian because atomate continues to call on MaxForceErrorHandler, which has been removed from custodian. I checked if there were any atomate updates but there don’t appear to be any past version 1.0.3. Do you know anything about progress being made on this regard? Regardless, I’m very excited about the progress made in custodian and I’m grateful for all the work you all have put in to fix this bug!

Hi @jamesgil,
thanks for the praise, which really belongs to @firaty !

I am using a custom atomate fork, so I kind of forgot to make a PR for the MaxForceHandler issue on the main repo. I will look a bit more closely if there is more to it than just commenting out or deleting all references to this, and then I will try to make a PR today. I think this should be fairly trivial compared to the custodian stuff.
Cheers, Michael