I’m running lots of simulations (~35 at once) on an HPC cluster. Each simulation takes well over the node time limit, so I have to submit a new slurm job to restart the simulation. Up until now, I’ve just been using a series of bash and Python scripts to do this, but those are not the best at handling unexpected events, like when a job fails.
I was curious to see what tools others in the LAMMPS community have used to solve this kind of problem. I know Atomate has a LAMMPS module, but I’ve heard it’s not super well maintained. Currently, I’m looking into generic workflow packages like Airflow and Luigi.
I have been using the “job dependencies” feature built into the batch system, e.g. Torque, LoadLeveler.
There you submit a sequence of jobs (so they are already waiting and gaining priority in the queue) and are set on hold while the current job is waiting and on successful completion the next is released and so on for chains of 10+ jobs. There also is the option to have a different script (or the same script with different flags) executed when a job fails instead of completes successfully, but I never used that. It is simpler to just create a lock file at the beginning of the job and remove it as the last step, so the next script can detect whether it has to recover from a failed job or just continues from a successful job.
I have been flirting with workflows languages for a while, but still not settle for one particular solution. Automatizing complex workflows is a big challenge and probably a borderline topic here. However, I believe that sharing experiences on this regard could benefit the LAMMPS community, as it opens the way to batch testing of big libraries of compounds consistently and repeatably. Here a few projects that I know:
My current interest is to manage LAMMPS and Moltemplate with a workflow manager. I assemble complex structures outside LAMMPS (e.g. combining a solid surface with a liquid to make an interface) and render the output structure and force field with Moltemplate. I find this approach very practical to handle complex DATA files, which are otherwise very hard to create with scripting. Next step is to automate the post-processing and data analysis, and to extend the workflow to other software (e.g. MD → DFT).