How to prevent simulation crash across all partitions

Priyanka_Iyer · May 31, 2021, 8:39am

Hello everyone,
I am trying to run 4 simulations on one node (with 128 cores) with a partition 4x32. However, at some point, one of them crashes (which is fine), but I want the other simulations (which haven’t crashed) to continue running… however, if one crashes all of them seem to get terminated. Any suggestions on how can I prevent this?
(I have tried using ‘thermo_modify lost warn lost/bond warn flush yes’ but this doesn’t help much…

akohlmey · May 31, 2021, 3:09pm

When one of the simulation crashes, then LAMMPS uses the MPI communicator to terminate the calculation for all parallel nodes. At that point all parallel partitions are stopped.

The only way to prevent that is to prevent LAMMPS from crashing.

Priyanka_Iyer · June 1, 2021, 9:36am

Many thanks for your reply. I found another way to run the programs in parallel without partition in my batch script using the & and wait keywords…

srun -n 32 job1… &
srun -n 32 job 2 … &
srun -n 32 job 3 … &
srun -n 32 job4 … &
wait

would this make the jobs ‘Independent’ of each other such that the crash of one does not affect another?

Thanks in advance

akohlmey · June 1, 2021, 11:09am

This is a question you have to ask your HPC admins as that depends on how processor affinity is configured. In the worst case all calculations would be run on the same 32 CPU cores. However it is not possible to give a definitive statement from remote.

Priyanka_Iyer · June 1, 2021, 11:43am

Thanks for your help, I’ll follow it up with the HPC admins.