I am running a large number of simulations in parallel using the universe variable and -partition flag with the 29 Oct 2020 distribution of LAMMPS. A sample input file is shown below.
variable d universe list of paths to individual simulations
shell cd $d
log log.lammps
include simulation.file
clear
shell cd /path/to/input.file
next d
jump input.file
Typically, I am running with 4 or 8 cores per partition and somewhere in the range of 100-400 partitions. An example submission command to our cluster where I was allocated 1344 cores across 14 nodes is below. In this case I had 768 total simulations to run, i.e. my list of paths to individual simulations
had length 768.
srun /path/to/29oct2020 -in input.file -partition 168x8 -plog none -pscreen none
A usual submission protocol for me is as follows. For each system, I run a short simulation for minimization and a quick equilibration, after which I run another very short simulation from which I estimate the efficiency to approximate the total time to run all of the longer equilibration/production simulations. I tried to attach scripts but was told I can’t. The important point here is that the input scripts of the simulation for the efficiency check and the long production simulation are identical except for the number of timesteps.
When using the -partition flag and universe variable for the shorter simulations (minimization and efficiency check) I run into no issues and get output as expected. Notably, tmp.lammps.variable iterates up to the total number of simulations in the variable list and the job finishes when all simulations are done running. However, when running the longer simulations, tmp.lammps.variable continues to iterate beyond the total number simulations in the variable list (right now running 768 total simulations, tmp.lammps.variable is at 843), and some of the final simulations appear to freeze with no error message when I check their log files. I’m not sure what to make of this. Any ideas about what might be going on or how I can test what is going on?