mpi dynamic processes, adaptive processes

Lammps users:

Does anyone have any experience with schedulers that can dynamically and adaptively change job size? An example of what I want to do is (1) submit a large mpi LAMMPS job that runs for weeks, spawning MPI processes and taking up the whole cluster (or some maximum), if the resources are available (2) If in that time, another job is submitted of equal or higher priority, the first job contracts (spawns terminated), but continues. Has anyone done this? I read the SLURM scheduler documentation http://slurm.schedmd.com/faq.html#job_size

but it is very brief and I am not sure if this is the right direction.

Andrew Petersen

Dear Andrew,

I think this is a harder problem to solve than just changing the scheduler. Presumably, when tasks are taken away, LAMMPS would need to handle the kill signal, re-decompose the system, and then migrate all of the atoms to the new processors. Similarly for the addition of new tasks.

Niall is correct. What MPI version supports add/subtract of MPI

processes to a running simulation? Not v 1 or 2 so far as I know.

And LAMMPS (or any application) would have to be coded to

take communicate with MPI and the scheduler appropriately.

Steve

Lammps users:

Does anyone have any experience with schedulers that can dynamically and
adaptively change job size? An example of what I want to do is (1) submit
a large mpi LAMMPS job that runs for weeks, spawning MPI processes and
taking up the whole cluster (or some maximum), if the resources are
available (2) If in that time, another job is submitted of equal or higher
priority, the first job contracts (spawns terminated), but continues. Has
anyone done this? I read the SLURM scheduler documentation
http://slurm.schedmd.com/faq.html#job_size
but it is very brief and I am not sure if this is the right direction.

​since the internal checkpointing (via restart) files in LAMMPS is easy and
efficient, there is very little need to do it like this.

what i and other people have done in the past would be to run the
simulation in increments, but use a scheduler that allows to specify a
range of resources. e.g. use 200-400 nodes and also specify the maximum
walltime as a range. the scheduler would then use the "best fit" and in the
job submission script one can compute the available resources and then
adjust the number of time steps that this specific increment of the
simulation would run so that it fits within the provided reservation.

running a single calculation for weeks is practically never a good idea.
since the cost of restarting/checkpointing is negligible in almost all but
the most extreme cases, that is the way to go. all these other things fall
into the category "good from far, but far from good" IMNSHO.

axel.