Debugging a Firework that defuses children

I have the a workflow that does the following:

  1. OptimizationFW with custodian job type full_optimization
  2. OptimizationFW with custodian job type normal
  3. OptimizationFW with custodian job type normal

Sometimes the first Firework reaches state COMPLETED, yet defuses the next Firework and I’m not sure why. I have tried running the same structure again and was able to reproduce the same behavior, but a different structure did not.
I thought this might have been related to the number of optimiziations by custodian, but even with more relaxations in the second structure I tried, I cannot reproduce it.
The main different, it seems, besides the structures being different is the amount of time the first Firework takes (~5 hours) compared to ~30 minutes in the second test case.
I’m not running into any walltime issues that I’m aware of.

This is partly a question of how to debug this.
Below is the -d more of the full optimization Firework that is COMPLETED, but for some reason defused the next one.
Is my understanding correct that what caused this was the PassCalcLocs Firetask (the fourth task in my Firework)? It is right after the RunVaspCustodian task.

What next steps can I take to find out what’s going on?

{
“name”: “MgCu-structure optimization”,
“launches”: [
{
“fworker”: {
“category”: “”,
“query”: “{}”,
“name”: “ACI”,
“env”: {
“scratch_dir”: “/storage/home/bjb54/work/atomate-scratch”,
“vasp_cmd”: “mpirun vasp_std”,
“db_file”: “/storage/home/bjb54/work/atomate/config/db.json”,
“incar_update”: {
“ncore”: 4
}
}
},
“trackers”: [],
“ip”: “10.102.101.223”,
“fw_id”: 7,
“state”: “COMPLETED”,
“host”: “comp-bc-0223.acib.production.int.aci.ics.psu.edu”,
“launch_dir”: “/storage/work/bjb54/test-full-opt/launcher_2017-10-12-16-29-43-799766”,
“action”: {
“defuse_workflow”: false,
“update_spec”: {},
“mod_spec”: [
{
“_push_all”: {
“calc_locs”: [
{
“path”: “/storage/work/bjb54/test-full-opt/launcher_2017-10-12-16-29-43-799766”,
“name”: “structure optimization”,
“filesystem”: null
}
]
}
}
],
“stored_data”: {
“task_id”: 237
},
“exit”: false,
“detours”: [],
“additions”: [],
“defuse_children”: true
},
“launch_id”: 3,
“state_history”: [
{
“checkpoint”: {
“_task_n”: 4,
“_all_update_spec”: {},
“_all_mod_spec”: [
{
“_push_all”: {
“calc_locs”: [
{
“path”: “/storage/work/bjb54/test-full-opt/launcher_2017-10-12-16-29-43-799766”,
“name”: “structure optimization”,
“filesystem”: null
}
]
}
}
],
“_all_stored_data”: {}
},
“updated_on”: “2017-10-12T20:23:02.070220”,
“state”: “RUNNING”,
“created_on”: “2017-10-12T16:29:44.013708”
},
{
“state”: “COMPLETED”,
“created_on”: “2017-10-12T20:23:02.138567”
}
]
}
],
“fw_id”: 7,
“state”: “COMPLETED”,
“created_on”: “2017-10-12T16:27:26.194861”,
“updated_on”: “2017-10-12T20:23:02.415661”
}

My guess is that database insertion step defused it due to an unsuccessful run, see:

atomate/vasp/firetasks/parse_outputs.py:101

···

On Thu, Oct 12, 2017 at 2:46 PM, Brandon B [email protected] wrote:

I have the a workflow that does the following:

  1. OptimizationFW with custodian job type full_optimization
  2. OptimizationFW with custodian job type normal
  3. OptimizationFW with custodian job type normal

Sometimes the first Firework reaches state COMPLETED, yet defuses the next Firework and I’m not sure why. I have tried running the same structure again and was able to reproduce the same behavior, but a different structure did not.
I thought this might have been related to the number of optimiziations by custodian, but even with more relaxations in the second structure I tried, I cannot reproduce it.
The main different, it seems, besides the structures being different is the amount of time the first Firework takes (~5 hours) compared to ~30 minutes in the second test case.
I’m not running into any walltime issues that I’m aware of.

This is partly a question of how to debug this.
Below is the -d more of the full optimization Firework that is COMPLETED, but for some reason defused the next one.
Is my understanding correct that what caused this was the PassCalcLocs Firetask (the fourth task in my Firework)? It is right after the RunVaspCustodian task.

What next steps can I take to find out what’s going on?

{
“name”: “MgCu-structure optimization”,
“launches”: [
{
“fworker”: {
“category”: “”,
“query”: “{}”,
“name”: “ACI”,
“env”: {
“scratch_dir”: “/storage/home/bjb54/work/atomate-scratch”,
“vasp_cmd”: “mpirun vasp_std”,
“db_file”: “/storage/home/bjb54/work/atomate/config/db.json”,
“incar_update”: {
“ncore”: 4
}
}
},
“trackers”: [],
“ip”: “10.102.101.223”,
“fw_id”: 7,
“state”: “COMPLETED”,
“host”: “comp-bc-0223.acib.production.int.aci.ics.psu.edu”,
“launch_dir”: “/storage/work/bjb54/test-full-opt/launcher_2017-10-12-16-29-43-799766”,
“action”: {
“defuse_workflow”: false,
“update_spec”: {},
“mod_spec”: [
{
“_push_all”: {
“calc_locs”: [
{
“path”: “/storage/work/bjb54/test-full-opt/launcher_2017-10-12-16-29-43-799766”,
“name”: “structure optimization”,
“filesystem”: null
}
]
}
}
],
“stored_data”: {
“task_id”: 237
},
“exit”: false,
“detours”: [],
“additions”: [],
“defuse_children”: true
},
“launch_id”: 3,
“state_history”: [
{
“checkpoint”: {
“_task_n”: 4,
“_all_update_spec”: {},
“_all_mod_spec”: [
{
“_push_all”: {
“calc_locs”: [
{
“path”: “/storage/work/bjb54/test-full-opt/launcher_2017-10-12-16-29-43-799766”,
“name”: “structure optimization”,
“filesystem”: null
}
]
}
}
],
“_all_stored_data”: {}
},
“updated_on”: “2017-10-12T20:23:02.070220”,
“state”: “RUNNING”,
“created_on”: “2017-10-12T16:29:44.013708”
},
{
“state”: “COMPLETED”,
“created_on”: “2017-10-12T20:23:02.138567”
}
]
}
],
“fw_id”: 7,
“state”: “COMPLETED”,
“created_on”: “2017-10-12T16:27:26.194861”,
“updated_on”: “2017-10-12T20:23:02.415661”
}

You received this message because you are subscribed to the Google Groups “atomate” group.

To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].

To post to this group, send email to [email protected].

To view this discussion on the web visit https://groups.google.com/d/msgid/atomate/bdaffbce-9f3f-4a2f-b2e9-c5dbda2c4d0c%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Best,
Anubhav

I am not sure what happened to the message, but Brandon replied:

···

========

I saw that as well. Seems like the only place that could come from.

is it the correct behavior for jobs to be marked as COMPLETED rather than

FIZZLED if this is the case? Is the rationale that I try parsing those

outputs by hand?

========

It is up to use to define what is the correct behavior. The “defuse” action happens when the VASP run completes OK, but something is still wrong. For example, let’s say you hit the maximum number of ionic steps and didn’t actually converge to your desired tolerance. The job didn’t “fail” and the VASP outputs would look normal at first glance, but we also didn’t succeed in getting what we want to proceed to the next step.

Two options are:

  1. Call that particular run COMPLETED but defuse remaining runs. This (the current behavior) reflects the idea that the run did in fact complete without any errors, but that we wouldn’t want to continue on with the workflow.

  2. Call that particular run FIZZLED; this would alert a user that this perhaps a run that needs some kind of “fixing”.

I’m happy to discuss with some of the others if you think (2) is the better way forward. Let me know.

Best,

Anubhav

On Thursday, October 12, 2017 at 2:46:06 PM UTC-7, Brandon B wrote:

I have the a workflow that does the following:

  1. OptimizationFW with custodian job type full_optimization
  2. OptimizationFW with custodian job type normal
  3. OptimizationFW with custodian job type normal

Sometimes the first Firework reaches state COMPLETED, yet defuses the next Firework and I’m not sure why. I have tried running the same structure again and was able to reproduce the same behavior, but a different structure did not.
I thought this might have been related to the number of optimiziations by custodian, but even with more relaxations in the second structure I tried, I cannot reproduce it.
The main different, it seems, besides the structures being different is the amount of time the first Firework takes (~5 hours) compared to ~30 minutes in the second test case.
I’m not running into any walltime issues that I’m aware of.

This is partly a question of how to debug this.
Below is the -d more of the full optimization Firework that is COMPLETED, but for some reason defused the next one.
Is my understanding correct that what caused this was the PassCalcLocs Firetask (the fourth task in my Firework)? It is right after the RunVaspCustodian task.

What next steps can I take to find out what’s going on?

{
“name”: “MgCu-structure optimization”,
“launches”: [
{
“fworker”: {
“category”: “”,
“query”: “{}”,
“name”: “ACI”,
“env”: {
“scratch_dir”: “/storage/home/bjb54/work/atomate-scratch”,
“vasp_cmd”: “mpirun vasp_std”,
“db_file”: “/storage/home/bjb54/work/atomate/config/db.json”,
“incar_update”: {
“ncore”: 4
}
}
},
“trackers”: [],
“ip”: “10.102.101.223”,
“fw_id”: 7,
“state”: “COMPLETED”,
“host”: “comp-bc-0223.acib.production.int.aci.ics.psu.edu”,
“launch_dir”: “/storage/work/bjb54/test-full-opt/launcher_2017-10-12-16-29-43-799766”,
“action”: {
“defuse_workflow”: false,
“update_spec”: {},
“mod_spec”: [
{
“_push_all”: {
“calc_locs”: [
{
“path”: “/storage/work/bjb54/test-full-opt/launcher_2017-10-12-16-29-43-799766”,
“name”: “structure optimization”,
“filesystem”: null
}
]
}
}
],
“stored_data”: {
“task_id”: 237
},
“exit”: false,
“detours”: [],
“additions”: [],
“defuse_children”: true
},
“launch_id”: 3,
“state_history”: [
{
“checkpoint”: {
“_task_n”: 4,
“_all_update_spec”: {},
“_all_mod_spec”: [
{
“_push_all”: {
“calc_locs”: [
{
“path”: “/storage/work/bjb54/test-full-opt/launcher_2017-10-12-16-29-43-799766”,
“name”: “structure optimization”,
“filesystem”: null
}
]
}
}
],
“_all_stored_data”: {}
},
“updated_on”: “2017-10-12T20:23:02.070220”,
“state”: “RUNNING”,
“created_on”: “2017-10-12T16:29:44.013708”
},
{
“state”: “COMPLETED”,
“created_on”: “2017-10-12T20:23:02.138567”
}
]
}
],
“fw_id”: 7,
“state”: “COMPLETED”,
“created_on”: “2017-10-12T16:27:26.194861”,
“updated_on”: “2017-10-12T20:23:02.415661”
}