Question about Builder

Hello,

Our group is trying to make database by using atomate.

Now I have a question about Builder and timing to find duplicate materials.

To my understanding, Builders have to run AFTER some workflows (e.g., wf_dielectric_constant, wf_bandstructure_hse, etc…) finish and then reduce duplicate materials by StructureMatcher and update “materials” collection.

However, now we try to make RunBuilderTask(FireTask), that enables Builders to run StructureMatcher and to update “materials” collection at the end of OptimizationFW, i.e., BEFORE next fireworks.

It seems more efficient because Workflow can exit by FWAction when duplicate structure is found, and then next unnecessary Fireworks (e.g., LepsFW, HSEBSFW, etc…) will NOT run.

Moreover, by this way we will not forget to run Builder.

So, I wonder why atomate doesn’t find duplicate structure and update “materials” collection IN Workflows.

If conflict of updating is the problem, how about using safe option, like “coll.update(dict1, dict2, safe=True)”?

And if such frequent updating is bad idea, how often should we run Builder?

If we should run Builder after workflows finish, we will automatically use Builder by crontab.

But should we use manually run Builder and check errors with my own eyes?

I would like to hear your advices and ideas.

Thanks,

Akira Takahashi

Hi Akira,

If I understand correctly, there are two concerns for which you prefer to run builders directly after the job finish:

  • duplicate matching

  • remembering to run the builder

For duplicate matching, one could insert a custom duplicate matching Firetask as the first step of a workflow. e.g., to code some kind of DuplicateTask. This task would do something like the follows before starting the workflow:

  • search the tasks database for anything with the same formula AND task_label (since we do want to run the same structure many times, so long as the task type is different)

  • for all matching tasks with the same formula and task_type, to use the StructureMatcher to confirm that the structure is the same (and not just the formula)

  • if both are true, use the FWAction to exit.

There are two problems with the above strategy:

(i) If we check for duplicates against the task (or materials) database, it does not do duplicate checking against other things scheduled to be run. e.g., if we enter the same job 1000 times into FireWorks, and the first 10 start running at the same time, neither will be aware that the others are the same job. In my opinion, this is a minor concern.

(ii) Often, one wants to run two workflows on the same material that share the same task_label. For example, one might want to run both a band structure and dielectric constant of the same material. In this case, both workflows share a structure optimization task, and if we simply quit when detecting the duplication we will miss out on being able to run both kinds of workflows.

To mitigate these problems, we used to use the built-in duplicate checking feature of FireWorks which we built to handle this kind of problem: https://materialsproject.github.io/fireworks/duplicates_tutorial.html

with a custom “DupeFinder” object.

But still, we found this to be somewhat cumbersome to manage, and also found it difficult to describe what exactly meant a job was duplicated. For example, two structure optimization tasks on the same material might be different if the second one has a more strict tolerance on convergence than the first. In this case, one would not want to skip the second job since the user might specifically be trying to run the same thing again but with stricter tolerance. We had solutions for this as well (e.g., add a “user_label” to the job like “strict_convergence” and make sure that duplicate checking also did not have any user_labels). However, it got a little complex. So, in the end we felt that with atomate the best is if we just erred on the side of doing too many computations, and having the builder figure out how to assemble those tasks into a coherent structure. That is why we stopped doing duplicate checks, although some kind of high-level duplicate checking (e.g., before submitting a workflow to the FireWorks database, check the FireWorks or tasks database to see if similar calculations already exist and do not submit if we see anything that’s the same) is something we would currently do in some of our projects where duplication may be an issue.

As for frequency of building, you could probably build as frequently as you’d like, with the caveat that I am not sure the builders are built to be safe against having two builder processes for the same builder being run in parallel. It would be best I think to have just one instance of a particular builder running at once, or to at least inspect the code to make sure there would not be problems from two builders running simultaneously. Note that for conflicts, the problem is not really the integrity of the write (which can be handled with the “safe write” option), but more race conditions that might exist.

We ourselves typically use a cron job to initiate the building. Depending on the project, it could be once every hour or once every day.

I hope this answers your questions. Certainly, there is room for improvement within atomate and many times the reason a feature is not present is simply that we did not have time to do a good job in coding it, even though the feature could be very useful.

···

Best,
Anubhav

Hi Anubhav,

Thanks for very fast and detailed reply. Your reply is exactly what we wanted to know.

-duplicate matching

I discussed with my boss.

Because our project is relatively simple, consists of a few members, and probably new tasks like “strict_converge” will not be added, we concluded that problems you pointed out have relatively small effect and risk on our project.

Moreover, our project may include HSE calculation, which takes very long time, so we would like to reduce duplicate calculation as possible.

Then we determined to make FireTask to exit when a duplicate structure is found.

We also think about DupeFinder or more customized Firetask (enable not only exiting but also passing structure and duplicate result, reads also task_label, etc…) if necessary.

But anyway thank you for your advice.

-remembering to run the builder

Following your advice, I ran two builders running simultaneously and found that something strange can happens.

For example, when I execute

“python run_builders.py &; python run_builders.py”

(First run_builder.py runs in background and two builders run almost simultaneously),

tasks_id becomes [“t-36”, “t-38”, “t-36”, “t-38”], while only one builder makes [“t-36”, “t-38”].

That’s probably due to race condition as you say, i.e.,

  1. 1st Builder reads task_id from db

  2. 2nd Builder reads task_id from db

  3. 1st Builder adds task_id to db

  4. 2nd Builder (which doesn’t know 1st Builder added task_id) adds task_id to db.

Probably there are several ways to avoid it (locking DB when builder runs, or maybe more standard/beautiful ways exist…), but we determined use run_builders.py with crontab to reduce human cost and risk of mistake.

Again, we really appreciate your kind help.

Thanks,

Akira Takahashi

2018年11月18日日曜日 18時57分42秒 UTC+9 Anubhav Jain:

···

Hi Akira,

If I understand correctly, there are two concerns for which you prefer to run builders directly after the job finish:

  • duplicate matching
  • remembering to run the builder

For duplicate matching, one could insert a custom duplicate matching Firetask as the first step of a workflow. e.g., to code some kind of DuplicateTask. This task would do something like the follows before starting the workflow:

  • search the tasks database for anything with the same formula AND task_label (since we do want to run the same structure many times, so long as the task type is different)
  • for all matching tasks with the same formula and task_type, to use the StructureMatcher to confirm that the structure is the same (and not just the formula)
  • if both are true, use the FWAction to exit.

There are two problems with the above strategy:

(i) If we check for duplicates against the task (or materials) database, it does not do duplicate checking against other things scheduled to be run. e.g., if we enter the same job 1000 times into FireWorks, and the first 10 start running at the same time, neither will be aware that the others are the same job. In my opinion, this is a minor concern.

(ii) Often, one wants to run two workflows on the same material that share the same task_label. For example, one might want to run both a band structure and dielectric constant of the same material. In this case, both workflows share a structure optimization task, and if we simply quit when detecting the duplication we will miss out on being able to run both kinds of workflows.

To mitigate these problems, we used to use the built-in duplicate checking feature of FireWorks which we built to handle this kind of problem: https://materialsproject.github.io/fireworks/duplicates_tutorial.html

with a custom “DupeFinder” object.

But still, we found this to be somewhat cumbersome to manage, and also found it difficult to describe what exactly meant a job was duplicated. For example, two structure optimization tasks on the same material might be different if the second one has a more strict tolerance on convergence than the first. In this case, one would not want to skip the second job since the user might specifically be trying to run the same thing again but with stricter tolerance. We had solutions for this as well (e.g., add a “user_label” to the job like “strict_convergence” and make sure that duplicate checking also did not have any user_labels). However, it got a little complex. So, in the end we felt that with atomate the best is if we just erred on the side of doing too many computations, and having the builder figure out how to assemble those tasks into a coherent structure. That is why we stopped doing duplicate checks, although some kind of high-level duplicate checking (e.g., before submitting a workflow to the FireWorks database, check the FireWorks or tasks database to see if similar calculations already exist and do not submit if we see anything that’s the same) is something we would currently do in some of our projects where duplication may be an issue.

As for frequency of building, you could probably build as frequently as you’d like, with the caveat that I am not sure the builders are built to be safe against having two builder processes for the same builder being run in parallel. It would be best I think to have just one instance of a particular builder running at once, or to at least inspect the code to make sure there would not be problems from two builders running simultaneously. Note that for conflicts, the problem is not really the integrity of the write (which can be handled with the “safe write” option), but more race conditions that might exist.

We ourselves typically use a cron job to initiate the building. Depending on the project, it could be once every hour or once every day.

I hope this answers your questions. Certainly, there is room for improvement within atomate and many times the reason a feature is not present is simply that we did not have time to do a good job in coding it, even though the feature could be very useful.

On Sun, Nov 18, 2018 at 3:17 AM [email protected] wrote:

Hello,

Our group is trying to make database by using atomate.

Now I have a question about Builder and timing to find duplicate materials.

To my understanding, Builders have to run AFTER some workflows (e.g., wf_dielectric_constant, wf_bandstructure_hse, etc…) finish and then reduce duplicate materials by StructureMatcher and update “materials” collection.

However, now we try to make RunBuilderTask(FireTask), that enables Builders to run StructureMatcher and to update “materials” collection at the end of OptimizationFW, i.e., BEFORE next fireworks.

It seems more efficient because Workflow can exit by FWAction when duplicate structure is found, and then next unnecessary Fireworks (e.g., LepsFW, HSEBSFW, etc…) will NOT run.

Moreover, by this way we will not forget to run Builder.

So, I wonder why atomate doesn’t find duplicate structure and update “materials” collection IN Workflows.

If conflict of updating is the problem, how about using safe option, like “coll.update(dict1, dict2, safe=True)”?

And if such frequent updating is bad idea, how often should we run Builder?

If we should run Builder after workflows finish, we will automatically use Builder by crontab.

But should we use manually run Builder and check errors with my own eyes?

I would like to hear your advices and ideas.

Thanks,

Akira Takahashi

You received this message because you are subscribed to the Google Groups “atomate” group.

To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].

To post to this group, send email to [email protected].

For more options, visit https://groups.google.com/d/optout.


Best,
Anubhav