Hi Akira,
If I understand correctly, there are two concerns for which you prefer to run builders directly after the job finish:
For duplicate matching, one could insert a custom duplicate matching Firetask as the first step of a workflow. e.g., to code some kind of DuplicateTask. This task would do something like the follows before starting the workflow:
-
search the tasks database for anything with the same formula AND task_label (since we do want to run the same structure many times, so long as the task type is different)
-
for all matching tasks with the same formula and task_type, to use the StructureMatcher to confirm that the structure is the same (and not just the formula)
-
if both are true, use the FWAction to exit.
There are two problems with the above strategy:
(i) If we check for duplicates against the task (or materials) database, it does not do duplicate checking against other things scheduled to be run. e.g., if we enter the same job 1000 times into FireWorks, and the first 10 start running at the same time, neither will be aware that the others are the same job. In my opinion, this is a minor concern.
(ii) Often, one wants to run two workflows on the same material that share the same task_label. For example, one might want to run both a band structure and dielectric constant of the same material. In this case, both workflows share a structure optimization task, and if we simply quit when detecting the duplication we will miss out on being able to run both kinds of workflows.
To mitigate these problems, we used to use the built-in duplicate checking feature of FireWorks which we built to handle this kind of problem: https://materialsproject.github.io/fireworks/duplicates_tutorial.html
with a custom “DupeFinder” object.
But still, we found this to be somewhat cumbersome to manage, and also found it difficult to describe what exactly meant a job was duplicated. For example, two structure optimization tasks on the same material might be different if the second one has a more strict tolerance on convergence than the first. In this case, one would not want to skip the second job since the user might specifically be trying to run the same thing again but with stricter tolerance. We had solutions for this as well (e.g., add a “user_label” to the job like “strict_convergence” and make sure that duplicate checking also did not have any user_labels). However, it got a little complex. So, in the end we felt that with atomate the best is if we just erred on the side of doing too many computations, and having the builder figure out how to assemble those tasks into a coherent structure. That is why we stopped doing duplicate checks, although some kind of high-level duplicate checking (e.g., before submitting a workflow to the FireWorks database, check the FireWorks or tasks database to see if similar calculations already exist and do not submit if we see anything that’s the same) is something we would currently do in some of our projects where duplication may be an issue.
As for frequency of building, you could probably build as frequently as you’d like, with the caveat that I am not sure the builders are built to be safe against having two builder processes for the same builder being run in parallel. It would be best I think to have just one instance of a particular builder running at once, or to at least inspect the code to make sure there would not be problems from two builders running simultaneously. Note that for conflicts, the problem is not really the integrity of the write (which can be handled with the “safe write” option), but more race conditions that might exist.
We ourselves typically use a cron job to initiate the building. Depending on the project, it could be once every hour or once every day.
I hope this answers your questions. Certainly, there is room for improvement within atomate and many times the reason a feature is not present is simply that we did not have time to do a good job in coding it, even though the feature could be very useful.
···
Best,
Anubhav