Out of range atoms (PPPM) | LAMMPS is hanging instead of terminating

Dear LAMMPS developers,

I understand that the error "Out of range atoms - cannot compute PPPM"
can, for example, be caused by an insufficient precision of the kspace
solver or wrong "neighbor_modify" settings, as discussed here:

http://lammps.sandia.gov/threads/msg06996.html
http://lammps.sandia.gov/threads/msg41759.html.

If LAMMPS encounters this problem and raises an error, I would expect it
to terminate. However, in one of my recent simulations of SPC/E water
(14Mar16 version) this was not the case. LAMMPS simply hung with the
last line in the log-file being

"ERROR on proc 187: Out of range atoms - cannot compute PPPM
(../pppm.cpp:1918)"

until the job ran out of computing time.

I'm not sure if this is the expected behaviour and would appreciate some
advice.

Best regards,

Peter

Dear LAMMPS developers,

I understand that the error "Out of range atoms - cannot compute PPPM"
can, for example, be caused by an insufficient precision of the kspace
solver or wrong "neighbor_modify" settings, as discussed here:

http://lammps.sandia.gov/threads/msg06996.html
http://lammps.sandia.gov/threads/msg41759.html.

If LAMMPS encounters this problem and raises an error, I would expect it
to terminate. However, in one of my recent simulations of SPC/E water
(14Mar16 version) this was not the case. LAMMPS simply hung with the
last line in the log-file being

"ERROR on proc 187: Out of range atoms - cannot compute PPPM
(../pppm.cpp:1918)"

until the job ran out of computing time.

I'm not sure if this is the expected behaviour and would appreciate some
advice.

​this should not happen. if there is a hang, it would have to be caused by
either the MPI library or something else.​
line 1918 of pppm.cpp calls error->one() which in turn only calls
MPI_Comm_rank() and MPI_Abort().

​the two scenarios that i can think of where this could hang, would be
either a multi-partition run, where the MPI library only terminates the
processes associated with the faulting partition or the other processes
being stuck on some incorrectly programmed collective MPI call before being
terminated.

axel.

Dear Axel,

thank you for your thoughts on this. Here is the slurm log-file for the
run I reported, in case it might help:

application called MPI_Abort(MPI_COMM_WORLD, 1) - process 187
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 2339032.0 ON n16-043 CANCELLED AT
2016-03-28T01:12:1
slurmstepd: error: *** JOB 2339032 ON n16-043 CANCELLED AT
2016-03-28T01:12:13 D
srun: got SIGCONT
srun: forcing job termination

Surprisingly, the same thing even happens if read_restart can't find the
input file. With 256 cores (16 nodes x 16 cores), like in the PPPM
example, the process did not terminate correctly and hung. The same
problem happened with 32 cores (2 nodes x 16 cores). However, LAMMPS
terminated as expected, when I reran the script using 16 cores on a
single node.

I tried to reproduce the problem with the small programme below (source:
http://mpitutorial.com/tutorials/mpi-hello-world/), but everything
worked as expected. Do you think this is a case for our system
administrator or could it be LAMMPS related?

Best wishes,

Peter

#include <mpi.h>
#include <stdio.h>
#include <unistd.h>

int main(int argc, char** argv) {
// Initialize the MPI environment
MPI_Init(NULL, NULL);

// Get the number of processes
int world_size;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);

// Get the rank of the process
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

// Get the name of the processor
char processor_name[MPI_MAX_PROCESSOR_NAME];
int name_len;
MPI_Get_processor_name(processor_name, &name_len);

// Print off a hello world message
printf("Hello world from processor %s, rank %d"
" out of %d processors\n",
processor_name, world_rank, world_size);

// Try to abort job
MPI_Abort(MPI_COMM_WORLD,1);

// Wait
sleep(600);

// Finalize the MPI environment.
MPI_Finalize();
}

Dear Axel,

thank you for your thoughts on this. Here is the slurm log-file for the
run I reported, in case it might help:

application called MPI_Abort(MPI_COMM_WORLD, 1) - process 187
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 2339032.0 ON n16-043 CANCELLED AT
2016-03-28T01:12:1
slurmstepd: error: *** JOB 2339032 ON n16-043 CANCELLED AT
2016-03-28T01:12:13 D
srun: got SIGCONT
srun: forcing job termination

Surprisingly, the same thing even happens if read_restart can't find the
input file. With 256 cores (16 nodes x 16 cores), like in the PPPM
example, the process did not terminate correctly and hung. The same
problem happened with 32 cores (2 nodes x 16 cores). However, LAMMPS
terminated as expected, when I reran the script using 16 cores on a
single node.

I tried to reproduce the problem with the small programme below (source:
http://mpitutorial.com/tutorials/mpi-hello-world/), but everything
worked as expected. Do you think this is a case for our system
administrator or could it be LAMMPS related?

​in your test, you are calling MPI_Abort() from all MPI ranks. LAMMPS calls
MPI_Abort() in the scenarios you describe from only one MPI rank.

please try again, after changing:

​​
MPI_Abort(MPI_COMM_WORLD,1);

​to:

if (world_rank == 0) {
    ​
MPI_Abort(MPI_COMM_WORLD,1);
}

​axel.​

Dear Axel,

With your suggested modification, the programme no longer terminates
correctly if I run it on two or more nodes. So the script below
reproduces the odd MPI behaviour I encountered earlier with LAMMPS.

After some further testing I found that everything works fine with
openmpi-10.0 instead of the Intel MPI-libraries (version 5.0 Update 3
Build 20150128 (build id: 11250)). So it seems to be an MPI-issue
entirely and I've contacted our system administrators. As soon as I know
what caused the problem, I'll post it on the list for completeness.

Thank you for your help!

Best wishes,

Peter

Dear Axel,

Our system administrator looked into this issue and found out that the
problem is indeed caused by a bug in the Intel MPI 5.0.3.048 library.
With the bug-fixed MPI version, 5.0.3.049, everything works as expected.

Thanks again for your help!

Peter