I am trying to run 4 simulations on one node (with 128 cores) with a partition 4x32. However, at some point, one of them crashes (which is fine), but I want the other simulations (which haven’t crashed) to continue running… however, if one crashes all of them seem to get terminated. Any suggestions on how can I prevent this?
(I have tried using ‘thermo_modify lost warn lost/bond warn flush yes’ but this doesn’t help much…
When one of the simulation crashes, then LAMMPS uses the MPI communicator to terminate the calculation for all parallel nodes. At that point all parallel partitions are stopped.
The only way to prevent that is to prevent LAMMPS from crashing.
Many thanks for your reply. I found another way to run the programs in parallel without partition in my batch script using the & and wait keywords…
srun -n 32 job1… &
srun -n 32 job 2 … &
srun -n 32 job 3 … &
srun -n 32 job4 … &
would this make the jobs ‘Independent’ of each other such that the crash of one does not affect another?
Thanks in advance
This is a question you have to ask your HPC admins as that depends on how processor affinity is configured. In the worst case all calculations would be run on the same 32 CPU cores. However it is not possible to give a definitive statement from remote.
Thanks for your help, I’ll follow it up with the HPC admins.