Apparently inconsistent forces before/after dump

noam.bernstein · May 5, 2021, 6:29pm

Switching to -O1 doesn’t fix it for me. Switching to gcc 10.3.0 and -O1 doesn’t fix it either. This is all python 3.6.8 (CentOS 7 built-in). I think the first idea I have is tracking down the neighbor list differences. Currently looking in NPairHalfSizeBinNewtonTri.
[edited]

noam.bernstein · May 6, 2021, 12:23am

Sorry, the deleted posts were me misunderstanding. The neighbor lists returned by the sample python code refer to ghost atoms, and I wasn’t checking the underlying real atom index (tag in the c++ source). Is there any way to get that information (what global ID a ghost atom refers to) in the python neighbor list parsing?

akohlmey · May 6, 2021, 1:28am

You can use

atomid = lmp.extract_atom("id")

or

atomid = lmp.numpy.extract_atom("id")

To get the atom ID property indexed by the local index and then

id = atomid[i]

to convert the local atom index to the atom ID.

noam.bernstein · May 6, 2021, 1:33am

OK - I think I finally understand at least a bit of what’s going on, and why it’s so delicate. When I manually construct a neighbor list for my system, I always have 12 neighbors (as expected for fcc 1st neighbor cutoff). However, lammps sometimes ends up with only 11 neighbors. In particular, there’s supposed to be a bond between atoms 6 and 186 (1-based indices), and lammps doesn’t find it, because it fails the spatial ordering. The output below is from a modified NPairHalfBinNewtonTri::build that prints out the positions it’s comparing to enforce that ordering. It finds the relevant pair twice, once with one atom as the local, once with the other. However, both times it fails the ordering test.

The first time the z and y values are exactly equal, and the x ordering is clearly wrong, so it goes to the continue (line 93 of npair_half_bin_newton_tri.cpp). The second time the z values are exactly equal, but the y values are off infinitesimally, in the direction (4.17433876999299968702 == x[j][2] < ytmp == 4.17433876999300146338) that makes it fail the ordering that time as well. As a result, it rejects this pair both times.

The only way I can think of this happening is some sort of math imprecision when generating the periodic ghost atom positions. I don’t know why it’s happening in my code and some random ASE user but not yours, @rberger . This is with gcc 10.3.0 and -O1, so not terribly aggresive.

CHECK i 5 tagi 6 j 572 tagj 186 [xyz]tmp 15.90636704279138768925 19.20195834196779571812 11.80681300492785368306   x[j][012] 13.01430030773842183578 19.20195834196779571812 11.80681300492785368306  rsq 8.364050
CHECK i 0 tagi 0 j 210 tagj 1 fail x
CHECK i 185 tagi 186 j 662 tagj 6 [xyz]tmp 21.69050051289736202875 4.17433876999300146338 11.80681300492785368306   x[j][012] 24.58256724795032965858 4.17433876999299968702 11.80681300492785368306  rsq 8.364050
CHECK i 0 tagi 0 j 210 tagj 1 fail y

Can anyone point me at the place in the code where the periodic ghost atom positions are calculated, so I can investigate further?

noam.bernstein · May 6, 2021, 1:36am

Thanks for the extract_atom("id") syntax. I had to add nelem=nlocal+nghost otherwise it only gave me nlocal, but it’s working nicely now.

noam.bernstein · May 6, 2021, 3:52am

I’ve traced the problem one more step back. The two atoms whose interaction is missing start out with exactly the same y values. However (perhaps due to the triclinic box and the values of the tilts) they end up with slightly different values after the Verlet::setup call to x2lamda, namely -0.00000000000000011102 and -0.00000000000000005551. The next thing setup does is call domain->pbc(). It detects both of those as being < lo[1] == 0.0, and adds period[1] == 1.0 to them. With one the result is exactly 1.0 (the initial negative value is so small that the sum is rounded). Then, when domain->pbc() checks for >= hi[1] == 1.0, it subtracts 1.0 again, and the final scaled position of that atom is exactly 0.0 (even though it started out at -0.00000000000000005551). The other atom has a slightly bigger magnitude initial negative scaled position, and its value is not rounded off when period[1] == 1.0 is added to it.

As a result, after the whole process which creates the ghost atom positions (namely x2lamda → borders → lamda2x) one of the atoms’ ghost images gets its y position moved infinitesimally and the other does not, and so when the neighbor list loop goes through them, they are not exactly reciprocal (exactly equal y for one version of the pair, slightly different y for the other), and both pairs are excluded due to the spatial ordering.

This is all an internally consistent explanation as to what’s happening for me, except that I don’t understand why it’s not happening when @rberger runs the same script. Would you (@rberger, or anyone else) mind running this modified script eval_no_ase_with_nn.py (6.9 KB) and posting the output? It prints out the neighbor list to many digits, so maybe I can track what’s different on our two systems.

noam.bernstein · May 6, 2021, 1:22pm

OK - I’ve reproduced the problem with pure lammps executable runs, no python needed. It really looks like a neighbor list bug, but I suppose it could be a compiler bug. I’m using CentOS 7, gcc 10.3.0, and compiling with -O0 (it was the same with -O1), but very similar behavior was seen with gcc 7 earlier, although I haven’t done exactly the same test yet.

Here’s the initial script two_atoms.in (856 Bytes) , which sets the atoms position to exactly what my python code ended up setting them to. When I run lammps it finds no neighbors (even though the two atoms are close enough):

> lmp_serial -in two_atoms.in  | grep neigh
  max neighbors/atom: 2000, page size: 100000
  1 neighbor lists, perpetual/occasional/extra = 1 0 0
Total # of neighbors = 0
Ave neighs/atom = 0.0000000

If I run an evaluation which reads from the data file created by the first script, using this script two_atoms_from_data.in (80 Bytes) , it does find a neighbor pair, as I expect. Presumably things are getting wrapped/rounded off enough to prevent the instability that affects the first execution.

> lmp_serial -in two_atoms_from_data.in  | grep neigh
  max neighbors/atom: 2000, page size: 100000
  1 neighbor lists, perpetual/occasional/extra = 1 0 0
Total # of neighbors = 1
Ave neighs/atom = 0.50000000

Can anyone reproduce this behavior in this simple test case?

rberger · May 6, 2021, 3:32pm

With your two_atoms.in I see 0 neighbors. Adding newton off changes this to 2 neighbors.

Here is some additional debug output from NPairHalfBinNewtonTri::build()

ibin=3086, i=0
rsq: 8.36404999999992 <= 25
i = 0, j = 5  x[i] = (15.906367042791388 19.201958341967796 11.806813004927854) x[j] = (13.014300307738422 19.201958341967796 11.806813004927854)
IGNORED due to x[j][0] < xtmp
ibin=2939, i=1
rsq: 8.36404999999993 <= 25
i = 1, j = 7  x[i] = (21.690500512897362 4.1743387699930015 11.806813004927854) x[j] = (24.58256724795033 4.174338769993 11.806813004927854)
IGNORED due to x[j][1] < ytmp

noam.bernstein · May 6, 2021, 3:49pm

Thanks, I that does look consistent with what I saw when I added my own debugging output (although I found that adding more digits to the output helped me resolve some confusion, since these differences are all right around roundoff).

@akohlmey do you think this is a sufficiently simple demonstration of the issue that could be used for fixing the underlying problem? I’d be happy to open a github issue.