MPI_Allreduce errors (?)

I rebuilt LAMMPS last night and started getting strange errors when using
a Tersoff/ZBL potential. I can confirm that using eam/fs pair styles does
not reproduce the phenomenon, but then again, neither does using two MPI
processes instead of 4. I managed to reproduce it using the SiC potential
that comes with LAMMPS as follows:

# file in.breaklammps :
units metal
lattice diamond 5.43
region simbox block 0 5 0 5 -3 7
create_box 1 simbox
region mbox block INF INF INF INF 0 5
create_atoms 1 region mbox
pair_style tersoff/zbl
pair_coeff * * SiC.tersoff.zbl Si
mass 1 28
run 0

Now run with:
    mpiexec -n 4 lmp_openmpi -i in.breaklammps
The error says something like
    *** An error occurred in MPI_Allreduce
    *** on communicator MPI_COMM_WORLD
    *** MPI_ERR_TRUNCATE: message truncated
    *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort

I thought perhaps it was an error related to the packages, so I've done
the full package-update, etc., and it still did it. I then blew away all
the .cpp and .h files, re-downloaded from the Subversion database, and
tried again, so now I'm here.

Karl D. Hammond
University of Tennessee, Knoxville
[email protected]

"You can never know everything, and part of what you know is always
   wrong. A portion of wisdom lies in knowing that. A portion of courage
   lies in going on anyway."
"Nothing ever goes as you expect. Expect nothing, and you will not be
   surprised."

Without the input files, I have no idea.
Can you post a simple, small script that
shows the problem quickly and on a minimal

of procs?

Steve

I thought I already posted everything someone else would need to reproduce
it; let's try again.

Input file:
        # call this file in.breaklammps
        units metal
        lattice diamond 5.43
        region simbox block 0 5 0 5 -3 7
        create_box 1 simbox
        region mbox block INF INF INF INF 0 5
        create_atoms 1 region mbox
        pair_style tersoff/zbl
        pair_coeff * * SiC.tersoff.zbl Si
        mass 1 28
        run 0

Now run that with:
    mpiexec -n 4 lmp_openmpi -i in.breaklammps

The potential file, SiC.tersoff.zbl, is the one that comes with LAMMPS
(potentials directory).

I can confirm that the problem is reproducible after the 12Mar14 patch
(it works fine with 10Mar14, but not with 12Mar14).

-Karl

I thought I already posted everything someone else would need to reproduce
it; let's try again.

you did and i can reproduce it.

just talked to steve and we have an idea where it is coming from, but
not yet found the smoking gun.

axel.

sorry, I didn’t realize the script lines you included were
your entire script.

Axel has nearly figured out the issue I think. The 12Mar patch
altered some logic to accommodate a future package.
So it probably broke something accidentally.

Axel may post a new file soon - if not I’ll post
a patch early next week.

Steve

sorry for the late reply. the LAMMPS workshop and symposium had been
quite demanding and i needed some rest.

sorry, I didn't realize the script lines you included were
your entire script.

Axel has nearly figured out the issue I think. The 12Mar patch
altered some logic to accommodate a future package.
So it probably broke something accidentally.

yup. backing those changes out, silences the error. it is a rather
complex change, so it will take a little bit of effort to pinpoint the
exact cause.

Axel may post a new file soon - if not I'll post
a patch early next week.

for now a workaround is to use "post no" with run, which should avoid
the tainted code.

axel.

please find attached a patch which undoes the change in the march 12
patch that causes the reported problems.

lammps-undo-kokkos-stats.diff.gz (389 Bytes)

ok - I think I fixed this bug - it was confined to finish.cpp and related
to full neighbor lists and the new Kokkos checks

Steve

ok - I think I fixed this bug - it was confined to finish.cpp and related
to full neighbor lists and the new Kokkos checks

i think this time, i have to give you the "i don't understand this"
comment. you change results in code that looks like:

if (condition) break;
else break;

what is the point of that? you can just use "break;"

axel.

It’s really just a placeholder in the logic. At some point
there may be different tests put there for Kokkos vs regular
neighbor lists.

Steve

It's really just a placeholder in the logic. At some point
there may be different tests put there for Kokkos vs regular
neighbor lists.

fair enough, but the result is now a different logic than what was
there before the kokkos related changes were added on march 12th. here
is one example of the total change:

@@ -565,19 +573,23 @@ void Finish::end(int flag)
     // find a non-skip neighbor list containing full pairwise interactions
     // count neighbors in that list for stats purposes

- for (m = 0; m < neighbor->old_nrequest; m++)
+ for (m = 0; m < neighbor->old_nrequest; m++) {
       if (neighbor->old_requests[m]->full &&
- neighbor->old_requests[m]->skip == 0) break;
+ neighbor->old_requests[m]->skip == 0) {
+ if (lmp->kokkos && lmp->kokkos->neigh_list_kokkos(m)) break;
+ else break;
+ }
+ }

     nneighfull = 0;
     if (m < neighbor->old_nrequest) {
- if (neighbor->lists[m]->numneigh > 0) {
+ if (neighbor->lists[m]) {
         int inum = neighbor->lists[m]->inum;
         int *ilist = neighbor->lists[m]->ilist;
         int *numneigh = neighbor->lists[m]->numneigh;
         for (i = 0; i < inum; i++)
           nneighfull += numneigh[ilist[i]];
- }
+ } else if (lmp->kokkos) nneighfull = lmp->kokkos->neigh_count(m);

       tmp = nneighfull;
       stats(1,&tmp,&ave,&max,&min,10,histo);

you are losing the test for "neighbor->lists[m]->numneigh > 0" (where
the >0 is not needed, since numneigh is a pointer). now, if i read the
code correctly, with kokkos you can have a neighbor->lists[m] that is
a null pointer and instead you need to query kokkos for neighbor list
stats with the same index, so with that in mind, you'd need to add on
top of what is there currently the following change:

diff --git a/src/finish.cpp b/src/finish.cpp
index e9c6d3d..67c2085 100644
--- a/src/finish.cpp
+++ b/src/finish.cpp
@@ -583,13 +583,14 @@ void Finish::end(int flag)

     nneighfull = 0;
     if (m < neighbor->old_nrequest) {
- if (neighbor->lists[m]) {
+ if (neighbor->lists[m] && neighbor->lists[m]->numneigh)) {
         int inum = neighbor->lists[m]->inum;
         int *ilist = neighbor->lists[m]->ilist;
         int *numneigh = neighbor->lists[m]->numneigh;
         for (i = 0; i < inum; i++)
           nneighfull += numneigh[ilist[i]];
- } else if (lmp->kokkos) nneighfull = lmp->kokkos->neigh_count(m);
+ } else if (!neighbor->lists[m] && lmp->kokkos)
+ nneighfull = lmp->kokkos->neigh_count(m);

       tmp = nneighfull;
       stats(1,&tmp,&ave,&max,&min,10,histo);

and similarly at another location a little bit higher up add this
change to recover the original logic and protect against empty list
pointers:

diff --git a/src/finish.cpp b/src/finish.cpp
index e9c6d3d..713f02a 100644
--- a/src/finish.cpp
+++ b/src/finish.cpp
@@ -536,8 +536,10 @@ void Finish::end(int flag)
            neighbor->old_requests[m]->gran ||
            neighbor->old_requests[m]->respaouter ||
            neighbor->old_requests[m]->half_from_full) &&
- neighbor->old_requests[m]->skip == 0) {
- if (lmp->kokkos && lmp->kokkos->neigh_list_kokkos(m)) break;
+ neighbor->old_requests[m]->skip == 0 &&
+ neighbor->lists[m] && neighbor->lists[m]->numneigh) {
+ if (!neighbor->lists[m] && lmp->kokkos &&
+ lmp->kokkos->neigh_list_kokkos(m)) break;
         else break;
       }
     }

I think the current logic is fine and the old logic was overkill.

Namely, if inum > 0, numneigh will always be non-NULL.

Steve