[lammps-users] PRD - MPI_Bcast problem in prd.cpp

Michael_Rieger · November 20, 2009, 2:38pm

Dear lammps-users,

in addition to my first post on the PRD implementation in LAMMPS, i ran into another MPI related issue with the PRD implementation in lammps.
For the attached system I am running into a problem of an MPI_BCAST error "Message truncated".

It happens reproducable in situations, when the prd algorithm tries to broadcast its images to the other members. More closely, when it tries to broadcast MPI_INT datatypes. Searching in prd.cpp for those calls narrowed down the number of candidates. I then added some printf debug lines to the prd.cpp code alwasys right before the MPI_Bcast calls in PRD::replicate() to further narrow it down.

In detail, i added sth like "1. MPU_Bcast.. " at the first block:

if (commflag == 0) {
     // debug added by [email protected]...
     printf(" - %d %d before 1. MPI_Bcast(atom->image.. and sizeof(atom- >image) gives %d \n", ireplica, me, sizeof(atom->image));
     MPI_Bcast(atom->image,atom->nlocal,MPI_INT,ireplica,comm_replica);
     MPI_Bcast(atom->x[0],3*atom- >nlocal,MPI_DOUBLE,ireplica,comm_replica);

and "2. MPI_Bcast ..." at the second block:

if (me == 0) {
     // debug added by [email protected]...
     printf(" - %d %d natoms: %d before 2. MPI_Bcast(atom->image.. and sizeof(atom->image) gives %d \n", ireplica, me, natoms, sizeof(natoms));
       MPI_Bcast(tagall,natoms,MPI_INT,ireplica,comm_replica);
       MPI_Bcast(imageall,natoms,MPI_INT,ireplica,comm_replica);
       MPI_Bcast(xall[0],3*natoms,MPI_DOUBLE,ireplica,comm_replica);
     }

and finally "3. MPI_Bcast" in front of the last in that routine of that kind in prd.cpp.

From the corresponding output:

LAMMPS (7 Jul 2009)
Running on 2 partitions of processors
Setting up PRD ...
Step Clock Event Correlated Replica
  - 0 0 before 1. MPI_Bcast(atom->image.. and sizeof(atom->image) gives 4
  - 0 0 before 1. MPI_Bcast(atom->image.. and sizeof(atom->image) gives 4
10186 0 0 0 0
  - 0 0 before 1. MPI_Bcast(atom->image.. and sizeof(atom->image) gives 4
1 - MPI_BCAST : Message truncated
[1] Aborting program !
[1] Aborting program!
  - 0 0 before 1. MPI_Bcast(atom->image.. and sizeof(atom->image) gives 4
  - 0 0 before 1. MPI_Bcast(atom->image.. and sizeof(atom->image) gives 4
p1_19158: p4_error: : 14
rm_l_1_19169: (770.439696) net_send: could not write to fd=5, errno = 32
p1_19158: (770.440279) net_send: could not write to fd=5, errno = 32

I like to conclude that the truncation must happen at some stage during MPI communication there. My suspicion is that it might be caused by the static_cast of the double natoms to an int natoms in the prd.cpp routine.

Apart from a lammps binary, i prepared a tar for a system where this MPI error occurs to me on different platforms and MPI implementations also with the recently patched version from 21Nov.

Any help is appreciated,
Greets, Michael

testcase.tgz (581 KB)

sjplimp · November 30, 2009, 6:19pm

I reproduced this problem, but I think it is a problem with your
simulation, not with PRD per se. I assume you are running
on 2 replicas, of one proc each. One of the replicas has lost
an atom. There is a WARNING: lost atoms message in
the log file for one of the replicas. So when they try to share
coords via the MPI_Bcast(), things get messed up.

PRD assumes that all the
replicas have identical numbers of atoms. It doesn't check for
this (probably should), but I don't think it makes sense to run
a problem where this isn't the case. The question is: why
did one of your replicas lose an atom?

Steve

Michael_Rieger · December 1, 2009, 2:07pm

Thanks a lot!

i changed the boundaries in z-direction from "ff" to "p" and it works now.
In addition, I checked all other trial-prd-runs as well and there weren't such warning messages, however, changing the boundary style worked here, too.
So, i assume that the the write process of the lost atoms warning to the log file was a bit slow in those ases.

The reason why atoms got lost was just that my bottom layer started at z=0 and some atoms moved apparently too much downwards in my md run.

Could one use the Thermo::lost_check() routine for checking for lost atoms within PRD()?
My idea would be to check for lost atoms after quenching the replicas and throw an error in case of lost atoms.

Michael

sjplimp · December 1, 2009, 3:05pm

Your script had a line like thermo_modify lost warn/ignore. The default
is to throw an error when LAMMPS loses an atom.

Steve