Apparently inconsistent forces before/after dump

noam.bernstein · May 3, 2021, 11:16pm

I have a very simple input file, see below. Is there any plausible way that it’s possible for retest_0.dump and retest_1.dump to be different? Atoms seem to move (perhaps only by multiples of the periodic cell vectors - I’m still checking), but forces are definitely noticeably different. It’s an unperturbed fcc supercell, and the initial forces are all essentially 0 (~1e-14), but the second time they are not. This seems so basic I can’t imagine it’s actually a lammps bug (using the latest github version), but I can’t figure out why it’s happening.

read_data manual.lammps-data

pair_style eam/alloy
pair_coeff * * CuAg.eam.alloy Cu Ag

dump f0 all custom 1 retest_0.dump id type x y z fx fy fz
run 0

read_dump retest_0.dump 0 type x y z fx fy fz

dump f1 all custom 1 retest_1.dump id type x y z fx fy fz
run 0

akohlmey · May 4, 2021, 3:24am

Dump files will write out positions by default with 6 digits precision while double precision floating point numbers have 15-16 digits precision. So all positions that are not “exact” will be truncated to 6 digits precision which can cause quite significant forces.

try adding:

dump_modify f0 format float %20.15g

before the run 0

There are still some minor differences because floating point numbers are stored in base 2 exponential format internally and not all of those numbers can be exactly represented in base 10 floating point format and vice versa (classic example for that is 0.1).

noam.bernstein · May 4, 2021, 11:50am

Thanks, that fixed it. Now I need to understand why something very similar seems to be happening in ASE lammpslib as well. I’ll follow up if the problem there is real.

noam.bernstein · May 4, 2021, 12:48pm

Should it ever be possible to create a dump file with forces that are not consistent with the atomic positions in the same file?

I have a dump file that is created by within a call to ASE’s lammpslib.py, in high precision using your dump_modify suggestion. That file (ase_eval.lammps-dump-text (28.3 KB) ), which has quite significant forces (and indeed the lammpslib call returns those non-zero forces). When I read that file with a lammps executable, run and re-evaluate it (recalc.in (437 Bytes) , CuAg.eam.alloy.bz2 (497.8 KB), the forces appear small again.

I strongly suspect that either I’m doing something wrong or there’s a bug in lammpslib, but if I can’t reproduce the forces that lammps is getting when run by lammpslib, even with from what I can tell identical positions, I don’t know how to debug it. As an additional point of information, the problem shows up with this EAM, but I haven’t been able to reproduce this issue with lj/cut so far.

If you have any suggestions as to what might be wrong or what I could check, I’d appreciate it.

akohlmey · May 4, 2021, 1:23pm

First off, I would not transfer a dump file but either a binary restart or a data file as the baseline check.
Just issue a write_data or write_restart in ASE and then transfer those to the standalone LAMMPS executable and redo the dump to see if there is an inconsistency. Also from within ASE you should be able to read back the data/restart file and reproduce consistent forces.

akohlmey · May 4, 2021, 1:27pm

As for possible issues with ASE, I suggest to get a confirmation from the ASE developers for which version of LAMMPS they have last verified the correctness of the interface. There have been some cleanups, bugfixes, and changes to the library interface and thus the python module and those may result in incompatibilities if ASE interface was not properly updated.

noam.bernstein · May 4, 2021, 2:20pm

I’ll check. In any case, I’d like to be able to use the latest version in general, so for all intents and purposes I’m the “ASE developers” in this case, although probably it’d still be useful to have a working version to compare to.

noam.bernstein · May 4, 2021, 6:17pm

I have confirmed that with the version ASE’s CI uses (03 March 2020) and the latest I get the same error. Now I’ve also been able to reproduce it with lj/cut.

To summarize: I can invoke lammps from ASE lammpslib, within the python set a dump command and a write_data command. The python gets back incorrect forces (>> 0 for a perfect symmetry fcc supercell), and the dump command within the python write a file with the same wrong forces. This happens for a particular size supercell (6x6x6 x primitive fcc), but not for other sizes (2^3…7^3).

If I then run a conventional lammps executable rereading from either the dump output or the write_data output, I get sensible forces, ~1e-12, for the atom that has the wrong force in python.

I can’t prove or disprove a bug in lammpslib yet, but it’s definitely true that a dump file generated by lammps shows large forces and if you read that back in and re-evaluate you get 0 forces. Is there any way to figure out why lammps is ending up with a non-zero force for that atom?

akohlmey · May 4, 2021, 6:57pm

This looks like it requires good old fashioned debugging with checking every bit of data before and after and at every step in between. This is not something where I can give you a simple “do this, not that” kind of recommendation. I have a hard time even coming up with a plausible explanation for such a situation.

noam.bernstein · May 4, 2021, 8:13pm

Unfortunately, that’s what I suspected. Out of curiosity, would would it take to entice you or one of the other developers who are familiar with the python/library interface/internals to tackle this? An ASE reproducing script? Reproduce it with python but without any ASE, just lammps.lammps?

akohlmey · May 4, 2021, 8:22pm

I can’t talk for others. In my case a lot. My “dance card” is overfull already. And I don’t use ASE and never have and don’t have an interest to learn.

noam.bernstein · May 4, 2021, 8:27pm

Is the a particular developer who is a primary contact for the python interface?

akohlmey · May 4, 2021, 8:31pm

No. I’ve been doing a large amount of the refactoring work of the library interface and by extension the python module recently. And there is my colleague Richard Berger, who is the primary author of the PyLammps code that sits on top of it. Beyond that is Steve who started it.

rberger · May 4, 2021, 8:37pm

If you can make it happen with just Python and lammps.lammps I’ll take a look.

noam.bernstein · May 4, 2021, 8:47pm

Thanks, I’ll see if I can do that.

Meanwhile, I discovered that it works OK with neighbor style nsq, but not nbin. I’m willing to do some digging, especially if someone can point me to where in the code I might be able to add some debugging prints in the loop that actually loops over bins and checks distances, to see if the neighbor list is the first thing that goes wrong.

rberger · May 4, 2021, 9:23pm

Here is one way to extract neighbor list information:

from lammps import lammps

L = lammps()

# TODO: create system

# extract neighbor list information
idx = L.find_pair_neighlist("lj/cut", reqid=0)
mylist = L.numpy.get_neighlist(idx)
nlocal = L.extract_global("nlocal")
nghost = L.extract_global("nghost")
mass = L.numpy.extract_atom("mass")
atype = L.numpy.extract_atom("type", nelem=nlocal+nghost)
x = L.numpy.extract_atom("x", nelem=nlocal+nghost, dim=3)
v = L.numpy.extract_atom("v", nelem=nlocal+nghost, dim=3)
f = L.numpy.extract_atom("f", nelem=nlocal+nghost, dim=3)

for iatom, neighs in mylist:
    print("- {}".format(iatom), x[iatom], v[iatom], f[iatom], " : ",  len(neighs), "Neighbors")
    for jatom in neighs:
        if jatom < nlocal:
            print("    *  ", jatom, x[jatom], v[jatom], f[jatom])
        else:
            print("    * [GHOST]", jatom, x[jatom], v[jatom], f[jatom])

Reference: https://lammps.sandia.gov/doc/Python_module.html#lammps.lammps.find_pair_neighlist

noam.bernstein · May 4, 2021, 9:47pm

I’ll try that out.

For now, here is a python script without any ASE (a bit cumbersome, since I had to copy some bits from ASE, but those bits (the convert and wrap code) seem to be essential) which sets up the calculation, extracts the forces, and two of them end up quite large (112 and 184 in python indices, I believe). I add a dump to the python commands that are passed to lammps, and there’s also a lammps input file that will re-calculate from the dump file created by the python code, and give what I believe are correct forces.

eval_no_ase.py (4.4 KB) manual_dump.in (356 Bytes)

noam.bernstein · May 5, 2021, 1:37pm

While investigating I discovered another symptom - if, in the above python (no ASE) script, I call “run 0” multiple times, I get different potential energies (printed to log and returned by extracting the pe variable using the python interface). This is correlated with having different numbers of atom pairs in the printed neighbor list. Also, whether or not I call write_data before the first run 0 makes a big difference. The output below is just multiple calls to L.command('run 0') interspersed with calls to extract_variable('pe'), gather_atoms('f',....), and the various gather calls in the suggested neighbor list extract, followed by a count of the pairs where iatom or jatom matches the atom with large forces or another atom.

Here is some output with the write_data call. Note that the PE keeps on changing, although by tiny amounts, in later calls. Also the max force is only big the first call.

yes write_data
pe 2076.188168616551
f[112] = [-2.32886949e-01  5.98987526e-12 -2.51709764e-12]
npairs {111: 110, 112: 108}

pe 2075.959112876054
f[29] = [ 1.14539489e-11 -7.85149723e-12 -1.78204118e-11]
npairs {28: 70, 29: 69}

pe 2075.9591128760558
f[29] = [ 1.28181910e-11 -7.64899255e-12 -1.97388772e-11]
npairs {28: 70, 29: 69}

pe 2075.9591128760558
f[29] = [ 1.33439926e-11 -7.65254526e-12 -1.97388772e-11]
npairs {28: 70, 29: 69}

pe 2075.9591128760585
f[29] = [ 1.38413725e-11 -7.65254526e-12 -1.97388772e-11]
npairs {28: 70, 29: 69}

And here it is without write_data. The PE also changes (by tiny amounts) even after multiple calls, and the largest force is >> 0 for the first two calls, not just one.

no write_data
pe 2072.7411060432914
f[5] = [-4.62703984e+01  3.19055893e-12 -7.52709006e-12]
npairs {4: 97, 5: 90}

pe 2076.188168616551
f[112] = [-2.32886949e-01  5.98987526e-12 -2.51709764e-12]
npairs {111: 110, 112: 108}

pe 2075.959112876054
f[29] = [ 1.14539489e-11 -7.85149723e-12 -1.78204118e-11]
npairs {28: 70, 29: 69}

pe 2075.9591128760558
f[29] = [ 1.28181910e-11 -7.64899255e-12 -1.97388772e-11]
npairs {28: 70, 29: 69}

pe 2075.9591128760558
f[29] = [ 1.33439926e-11 -7.65254526e-12 -1.97388772e-11]
npairs {28: 70, 29: 69}

pe 2075.9591128760585
f[29] = [ 1.38413725e-11 -7.65254526e-12 -1.97388772e-11]
npairs {28: 70, 29: 69}

rberger · May 5, 2021, 3:04pm

Can’t reproduce it on my system. I’ve executed your Python script and used the manual_dump.in to reevaluate the forces. Nothing in particular stands out. I did some minor adjustments to manual_dump.in to get the same potential energy as the source script.

With write_dump:

$ python eval_no_ase.py 
f[112] [ 4.48352466e-12  8.10729262e-12 -2.87836421e-12]
max f 1.6659421106747266e-11
pe:  2075.959112876054
pe:  2075.9591128760553
pe:  2075.9591128760594
pe:  2075.9591128760585
pe:  2075.95911287606
pe:  2075.9591128760635
pe:  2075.959112876064
pe:  2075.9591128760635
pe:  2075.959112876063
pe:  2075.9591128760635

Without write_data:

$ python eval_no_ase.py 
f[112] [ 4.82458518e-12  7.44648787e-12 -2.20423679e-12]
max f 1.4540129276255389e-11
pe:  2075.959112876055
pe:  2075.959112876054
pe:  2075.9591128760553
pe:  2075.9591128760594
pe:  2075.9591128760585
pe:  2075.95911287606
pe:  2075.9591128760635
pe:  2075.959112876064
pe:  2075.9591128760635
pe:  2075.959112876063

And this is what I get after run 0 with manual_dump.in (corrected, added a few missing commands from source script)

$ lmp -in manual_dump.in_modified
pe: 2075.9591128761635446

All the forces with the python script and the reevaluated one are in the order of e-11, e-12, e-13. Nothing larger than e-11. Here is the file I get from your script.
ase_eval.lammps-dump-text (28.3 KB)

So from what I can tell those small fluctuations are just floating-point noise. You might have a too aggressive compiler and/or some optimization settings in your local / CI builds.

This was done with the latest master (09684b77) using a Release cmake build. I also didn’t see much difference with Debug (both gcc 10.2.0, Python 3.9.4).

Btw, you don’t have to do extract_variable("pe"). You can just call get_thermo("pe").

noam.bernstein · May 5, 2021, 5:25pm

Thanks for checking. I do know there’s at least one user on the ASE mailing list who was able to reproduce this, so it’s definitely not just my setup, but I can certainly believe that it’s some subtle optimization thing. It’s easy enough to try with less aggressive optimization.

Can you explain why even yours gives slightly different answers at each call? What’s non-deterministic about the process? There isn’t even any parallel different sum reduction non-associativity (at least not on mine, since it’s only 1 MPI task).