Dear lammps-users,
in addition to my first post on the PRD implementation in LAMMPS, i ran into another MPI related issue with the PRD implementation in lammps.
For the attached system I am running into a problem of an MPI_BCAST error "Message truncated".
It happens reproducable in situations, when the prd algorithm tries to broadcast its images to the other members. More closely, when it tries to broadcast MPI_INT datatypes. Searching in prd.cpp for those calls narrowed down the number of candidates. I then added some printf debug lines to the prd.cpp code alwasys right before the MPI_Bcast calls in PRD::replicate() to further narrow it down.
In detail, i added sth like "1. MPU_Bcast.. " at the first block:
if (commflag == 0) {
// debug added by [email protected]...
printf(" - %d %d before 1. MPI_Bcast(atom->image.. and sizeof(atom- >image) gives %d \n", ireplica, me, sizeof(atom->image));
MPI_Bcast(atom->image,atom->nlocal,MPI_INT,ireplica,comm_replica);
MPI_Bcast(atom->x[0],3*atom- >nlocal,MPI_DOUBLE,ireplica,comm_replica);
and "2. MPI_Bcast ..." at the second block:
if (me == 0) {
// debug added by [email protected]...
printf(" - %d %d natoms: %d before 2. MPI_Bcast(atom->image.. and sizeof(atom->image) gives %d \n", ireplica, me, natoms, sizeof(natoms));
MPI_Bcast(tagall,natoms,MPI_INT,ireplica,comm_replica);
MPI_Bcast(imageall,natoms,MPI_INT,ireplica,comm_replica);
MPI_Bcast(xall[0],3*natoms,MPI_DOUBLE,ireplica,comm_replica);
}
and finally "3. MPI_Bcast" in front of the last in that routine of that kind in prd.cpp.
From the corresponding output:
LAMMPS (7 Jul 2009)
Running on 2 partitions of processors
Setting up PRD ...
Step Clock Event Correlated Replica
- 0 0 before 1. MPI_Bcast(atom->image.. and sizeof(atom->image) gives 4
- 0 0 before 1. MPI_Bcast(atom->image.. and sizeof(atom->image) gives 4
10186 0 0 0 0
- 0 0 before 1. MPI_Bcast(atom->image.. and sizeof(atom->image) gives 4
1 - MPI_BCAST : Message truncated
[1] Aborting program !
[1] Aborting program!
- 0 0 before 1. MPI_Bcast(atom->image.. and sizeof(atom->image) gives 4
- 0 0 before 1. MPI_Bcast(atom->image.. and sizeof(atom->image) gives 4
p1_19158: p4_error: : 14
rm_l_1_19169: (770.439696) net_send: could not write to fd=5, errno = 32
p1_19158: (770.440279) net_send: could not write to fd=5, errno = 32
I like to conclude that the truncation must happen at some stage during MPI communication there. My suspicion is that it might be caused by the static_cast of the double natoms to an int natoms in the prd.cpp routine.
Apart from a lammps binary, i prepared a tar for a system where this MPI error occurs to me on different platforms and MPI implementations also with the recently patched version from 21Nov.
Any help is appreciated,
Greets, Michael
testcase.tgz (581 KB)