MPI_Sendrecv error

Dear all,

After several hours of MD run I get an error:

Fatal error in MPI_Sendrecv:
Message truncated, error stack:
MPIDI_CH3U_Receive_data_found(255): Message from rank 9 and tag 0 truncated; 11112 bytes received but buffer size is 4

srun: error: n207: task 20: Exited with exit code 1
srun: Terminating job step 1589472.0
slurmd[n3]: *** STEP 1589472.0 KILLED AT 2012-06-01T02:40:04 WITH SIGNAL 9 ***
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
slurmd[n206]: *** STEP 1589472.0 KILLED AT 2012-06-01T02:40:04 WITH SIGNAL 9 ***
slurmd[n4]: *** STEP 1589472.0 KILLED AT 2012-06-01T02:40:04 WITH SIGNAL 9 ***
slurmd[n9]: *** STEP 1589472.0 KILLED AT 2012-06-01T02:40:04 WITH SIGNAL 9 ***
slurmd[n16]: *** STEP 1589472.0 KILLED AT 2012-06-01T02:40:04 WITH SIGNAL 9 ***

If I continue running from the point were calculations stopped, after certain time I get the same error again. Can be any random step. Seems I should not seek the reason in the input script, anyway, attached.

I first though it was something with compilation but I compiled LAMMPS latest version on two different clusters and this error appears on both. I guess there is nothing wrong with compilation. I compiled one with mpiCC and mvapich2 and the other with mpicxx and mvapich2.

Any ideas were to look for the reason?

Best wishes,
Manana Koberidze

in.phase (1.3 KB)

Dear all,

After several hours of MD run I get an error:

Fatal error in MPI_Sendrecv:
Message truncated, error stack:
MPIDI_CH3U_Receive_data_found(255): Message from rank 9 and tag 0 truncated;
11112 bytes received but buffer size is 4

srun: error: n207: task 20: Exited with exit code 1
srun: Terminating job step 1589472.0
slurmd[n3]: *** STEP 1589472.0 KILLED AT 2012-06-01T02:40:04 WITH SIGNAL 9
***
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
slurmd[n206]: *** STEP 1589472.0 KILLED AT 2012-06-01T02:40:04 WITH SIGNAL 9
***
slurmd[n4]: *** STEP 1589472.0 KILLED AT 2012-06-01T02:40:04 WITH SIGNAL 9
***
slurmd[n9]: *** STEP 1589472.0 KILLED AT 2012-06-01T02:40:04 WITH SIGNAL 9
***
slurmd[n16]: *** STEP 1589472.0 KILLED AT 2012-06-01T02:40:04 WITH SIGNAL 9
***

If I continue running from the point were calculations stopped, after
certain time I get the same error again. Can be any random step. Seems I
should not seek the reason in the input script, anyway, attached.

I first though it was something with compilation but I compiled LAMMPS
latest version on two different clusters and this error appears on both. I
guess there is nothing wrong with compilation. I compiled one with mpiCC and
mvapich2 and the other with mpicxx and mvapich2.

Any ideas were to look for the reason?

this is just a guess, but are you by any chance
using too aggressive neighbor list settings?
why the "check no"?

neigh_modify every 10 delay 0 check no

perhaps you are losing atoms without noticing.
i would worry that some of the analysis that
you are running will not be able to handle that.

axel.

I would also print out thermo info every step
near where the breakdown occurs. Your system
could be blowing up.

Steve