I know that uloop can be used to run 1 to N number of simulations. But, is there a way to run N1 to N2 simulations instead? I know that loop can do that, but loop does not run simulations concurrently on multiple partitions. I am trying to run 300 simulations but I cannot run all of them together, so I need to run three batches of jobs each running 100 simulations, i.e. job1 runs 1 to 100, job2 runs 101 to 200, job3 runs 201 to 300.
Also, please check in with your local cluster administrator and get their advice. Running a large number of parallel short jobs is not easy to do right in general, and on a shared machine (i.e. a cluster) it is even more important to get things right. What happens if you launch a set of 100 jobs and, just because one or two of them crash, the other 90+ jobs also abort? Or if you have 100 processes trying to access a lock file or output file at once?
You need to get proper computing advice in this situation, and preferably advice suited to your cluster.
Please note that I fully endorse the advice by @srtee. Using some kind of workflow manager is the best and most efficient solution.
That said, nothing stops you from defining an equal style variable that adds an offset to the uloop style variable. That offset could be provided as an index style variable which can also be overridden from the command line.
Thanks @srtee and @akohlmey. I was able to work it around by passing the offset value to lammps. Now I can run concurrent simulations on batches of size batch_size. This is the commands in my job file:
export simFile=$(find . -type f -name "*.in" -print0 | xargs -0)
export n_sim_total=100 # the total number of total simulations
export batch_size=20
export loop_size=$((n_sim_total/batch_size-1))
export n_proc_per_partition_per_batch=$((SLURM_NPROCS/batch_size))
export n_proc_total_per_batch=$((batch_size*n_proc_per_partition_per_batch))
echo "Running $n_sim_total total number of MD simulations on $((loop_size+1)) batches where each batch"\
"runs $batch_size simulations." \
"Each batch reserves $n_proc_total_per_batch total number of processors with" \
"each simulation in the batch runnning on $n_proc_per_partition_per_batch processors"
module load lammps
for n in $(seq 0 $loop_size)
do
echo "processing the batch number $((n+1))"
export upper_rng=$(((n+1)*batch_size))
export offset=$((n*batch_size))
mpirun -n ${n_proc_total_per_batch} lmp -in $simFile \
-var RANDOM ${RANDOM} \
-var n_sim_total ${n_sim_total} \
-var batch_size ${batch_size} \
-var offset ${offset} \
-partition ${batch_size}x${n_proc_per_partition_per_batch}
done
However this code cannot handle situations where number of total simulations (n_sim_total) is not divisible by the batch size (batch_size) . I tried to break the loop using if "${ff_file_number}<${n_sim_total}" then "jump SELF break" somewhere in my lammps input file but faced the error below:
ERROR: Label wasn’t not found in input script
I know that there are some limitations with the jump SELF break and that is why I input my lammps script as -in $simFile. I then changed -in to -var fname but got the error below:
ERROR: Must use -in switch with multiple partitions (src/lammps.cpp:455)
I guess there is nothing that can be done at this moment, correct?
Also, I realized that every time each batch is running, the index of the log files reset to 0. Is there any way I can prevent this so I have as many log files as the total number of simulations and not the number of simulations at each batch?
UPDATE: I was able to solve the last problem by using the -plog keyword.
Please ask your local cluster administrator for help. If you are too afraid of them / unfriendly with them, then please get advice from a local mentor or supervisor who is familiar with using HPC resources.
There are lots of ways for your script to break and lots of ways it can be improved, which I won’t go into because this is a LAMMPS forum and not a Bash (or shell) forum, but what you’re saying here is especially jarring:
Think about what it would mean if you had succeeded. You would have booked (let’s say) 80 processors for your simulations, and if your total set of simulation was (let’s say) 90 simulations, what would have happened? You’d have run the last 10 simulations on 4 cores each and let the other 40 cores just … sit there doing nothing, for the duration of an entire run? Isn’t that an unthoughtful use of shared computing resources?
I do understand this kind of situation – I once worked on a cluster which only accepted full-node allocations of 24 cores, and my problem definitely did not scale to that size. I had to use partition 4x6 and similar commands like you are trying now. But I did not try to script multiple batches into a single submission like you, and I adjusted my batch sizes (for example going to 3x8 or 2x12 as needed) to make sure I was using all the resources I asked for.