Dear Axel,
thank you for your thoughts on this. Here is the slurm log-file for the
run I reported, in case it might help:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 187
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 2339032.0 ON n16-043 CANCELLED AT
2016-03-28T01:12:1
slurmstepd: error: *** JOB 2339032 ON n16-043 CANCELLED AT
2016-03-28T01:12:13 D
srun: got SIGCONT
srun: forcing job termination
Surprisingly, the same thing even happens if read_restart can't find the
input file. With 256 cores (16 nodes x 16 cores), like in the PPPM
example, the process did not terminate correctly and hung. The same
problem happened with 32 cores (2 nodes x 16 cores). However, LAMMPS
terminated as expected, when I reran the script using 16 cores on a
single node.
I tried to reproduce the problem with the small programme below (source:
http://mpitutorial.com/tutorials/mpi-hello-world/), but everything
worked as expected. Do you think this is a case for our system
administrator or could it be LAMMPS related?
Best wishes,
Peter
#include <mpi.h>
#include <stdio.h>
#include <unistd.h>
int main(int argc, char** argv) {
// Initialize the MPI environment
MPI_Init(NULL, NULL);
// Get the number of processes
int world_size;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
// Get the rank of the process
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
// Get the name of the processor
char processor_name[MPI_MAX_PROCESSOR_NAME];
int name_len;
MPI_Get_processor_name(processor_name, &name_len);
// Print off a hello world message
printf("Hello world from processor %s, rank %d"
" out of %d processors\n",
processor_name, world_rank, world_size);
// Try to abort job
MPI_Abort(MPI_COMM_WORLD,1);
// Wait
sleep(600);
// Finalize the MPI environment.
MPI_Finalize();
}