Missing bond contributions from compute stress/mop

chrisp · October 22, 2024, 4:16pm

Hello all,

While performing stress calculations with compute stress/mop, I noticed that I would get statistically different results when running the simulation on single vs. multiple cores. Specifically, the average stresses would decrease as the number of cores increased.

I managed to pinpoint the difference on the bond contributions. To demonstrate this I used the input file (attached) from the relevant pull request: Inclusion of bond & angle contributions to "compute stress/mop" by evoyiatzis · Pull Request #3792 · lammps/lammps · GitHub. The file contains two particles near the box center. In the first case they are interacting solely through a harmonic potential, while in the second through a harmonic bond. Running on a single core (essentially mpiexec -n 1 lmp -in in.test_bonds_mop.txt) and using compute stress/mop on the midplane between them, identical stresses are calculated for both cases, as expected.

Splitting the simulation in 2 mpi domains (mpiexec -n 2 lmp -in in.test_bonds_mop.txt) places the particles in different domains. In this case, the forces for the harmonic potential are calculated as before, but the harmonic bond forces are completely ignored. This leads me to believe that the implementation of bond contributions to compute stress/mop ignores bonds that cross computational domains, even if they are within the communication cutoff (which is ensured through the pair style zero cutoff value).

This post is essentially a continuation of an older post of mine (Difference between compute pressure & stress/mop) that fell in obscurity. I decided to bring it up again since I cannot see how this could be the intended behavior, and there are no relevant comments in the documentation. I am using LAMMPS version 29Aug2024.

in.test_bonds_mop.txt (1.2 KB)

akohlmey · October 22, 2024, 5:29pm

Perhaps @evoyiatzis has some comments on this.

srtee · October 22, 2024, 10:15pm

Could you check if either newton off or newton on change the behaviour?

Differences in 1 vs many MPI procs behaviour often come down to errors in ghost atom handling.

From a quick glance through I suspect the error is here: lammps/src/EXTRA-COMPUTE/compute_stress_mop.cpp at 59bbc5bcc1104bdb4fb45107cd65b5d4d76dbc00 · lammps/lammps · GitHub

(where maybe atom1 needs to run up to nlocal+nghost depending on newton?)

or here:

github.com

lammps/lammps/blob/59bbc5bcc1104bdb4fb45107cd65b5d4d76dbc00/src/EXTRA-COMPUTE/compute_stress_mop.cpp#L582


      
          if (molecular == 1) {
            btype = bond_type[atom1][i];
            atom2 = atom->map(bond_atom[atom1][i]);
          } else {
            tagprev = tag[atom1] - iatom - 1;
            btype = onemols[imol]->bond_type[iatom][i];
            atom2 = atom->map(onemols[imol]->bond_atom[iatom][i] + tagprev);
          }
          
          if (atom2 < 0 || !(mask[atom2] & groupbit)) continue;
          if (newton_bond == 0 && tag[atom1] > tag[atom2]) continue;
          if (btype <= 0) continue;
          
          // minimum image of atom1 with respect to the plane of interest
          
          dx[0] = x[atom1][0];
          dx[1] = x[atom1][1];
          dx[2] = x[atom1][2];
          dx[dir] -= pos;
          domain->minimum_image(dx[0], dx[1], dx[2]);
          x_bond_1[0] = dx[0];

(where the multitest may erroneously still double-count some bond pairs?).

chrisp · October 23, 2024, 9:25am

I forgot to mention this on my original post, both newton on and newton off give the same result.

Regarding the areas on the source code that you pointed out, sadly I don’t think I have the programming knowledge to provide some feedback. But from what I can understand from the tests I ran, this issue is indeed a bug and not an implementation mistake on my part.

srtee · October 23, 2024, 11:45am

Thanks – your tests are helping tremendously.

Could you try a 2 proc run, but forcing both beads into the same MPI proc? You can accomplish this (for example) by forcing the split to be along the z-axis with

processors 1 1 2

and then generating the particles at z = 4 and 6 instead of -1 and 1. (For good measure, we should also try z = -6 and -4.)

chrisp · October 23, 2024, 2:51pm

Ran some more trials following your suggestions. First of all, I decided to isolate the effect on the z-axis, so both particles x and y coordinates are set to zero. MPI procs are separated with the
processors 1 1 2 command, newton is always set to off. Let’s name the particles 1 & 2, following the order in which they were created. This is a summary of my results:

z_1<0 (belongs to the bottom subdomain) while z_2>0. Bond contributions are actually calculated correctly.
z_1>0 while z_2<0. NO bond contributions. So just switching the domains of the particles gives different results.
z_1>z_2>0 or z_2>z_1>0. NO bond contributions.
z_1<z_2<0 or z_2<z_1<0. Correct bond contributions.

It looks like bond contributions will be accounted for when both particles belong to the bottom subdomain, but will be ignored when both belong to the top subdomain. There is also the peculiar case where the particles belong to different subdomains. This will only give correct results when particle 1 occupies the bottom subdomain.

Following that, I tested with 4 subdomains along the z axis and the results where similar. Bond contributions were calculated correctly only when both particles occupied the ‘lowest’ domain, and for the peculiar case where particle 1 occupied the lowest and particle 2 the second lowest domain.

I hope it can be of help

evoyiatzis · October 23, 2024, 3:13pm

Yes it is a bug. The serial version gives the correct results while the parallel version does not.

I suggest either adding a flag that only serial execution is allowed or reverting the PR for bond / angle / dihedral contribution to stress/mop. I am almost certain that the same problem will appear for angle / dihedral.

@chrisp can you also check if erroneous results are obtained for the parallel execution when angle / dihedral interactions are present?

akohlmey · October 23, 2024, 4:14pm

Done in commit: Collected small changes and fixes by akohlmey · Pull Request #4357 · lammps/lammps · GitHub

srtee · October 24, 2024, 6:36am

I think I’ve found the bug:

In init(), bondflag and other flags are set inside the if (comm->me == 0) scope.

This means that all procs other than proc 0 have all flags set to zero. Thus, in compute_vector, calculations of all molecular contributions on procs > 0 are skipped.

This is consistent with @chrisp 's results where the only determinant triggering the bug was the bond being “owned” by the first proc (proc 0 in C++ numbering), whether the bond crossed subdomains or not (very good testing!!)

I will note that the “addressing” approach (all the m tests) in compute_vector and throughout the compute feels like there’s room for optimisation. Low-hanging fruit – the various local calculations should be done and tallied into values_local first, and then one big MPI_Allgather should be done at the end, to reduce MPI comms (and imbalance).