unexpected behavior with fix npt after restart (UNCLASSIFIED)

_Barnes_Brian_C_CIV · July 31, 2019, 6:19pm

CLASSIFICATION: UNCLASSIFIED

Hello,

I'm trying to debug a complicated simulation and have created the following reduced testcase:

----BEGIN FILE---
units lj
atom_style atomic
lattice fcc 0.8442
region box block 0 4 0 4 0 4
create_box 1 box
create_atoms 1 box
mass 1 1.0
velocity all create 0.1 87287

pair_style lj/cut 2.5
pair_coeff 1 1 1.0 1.0 2.5
neigh_modify every 1 delay 0 check no
#fix 1 all nvt temp 0.1 0.8 1.0
fix 1 all npt temp 0.1 0.8 1.0 iso 0.0 0.0 10.0
#fix 1 all nve
dump 1 all custom 1000 dump.melt.* id type xs ys zs
dump_modify 1 sort id

run 1000
unfix 1
#fix 1 all nvt temp 0.8 0.8 1.0
fix 1 all npt temp 0.8 0.8 1.0 iso 0.0 0.0 10.0
#fix 1 all nve
run 1000
write_restart tmp.restart.*
run 4000

clear
read_restart tmp.restart.2000
neigh_modify every 1 delay 0 check no
#fix 1 all nvt temp 0.8 0.8 1.0
fix 1 all npt temp 0.8 0.8 1.0 iso 0.0 0.0 10.0
#fix 1 all nve
dump 1 all custom 1000 dump.meltR.* id type xs ys zs
dump_modify 1 sort id

run 4000
---END FILE---

The input creates a simple crystal lattice of Lennard-Jones particles, runs for 1000 steps heating them up, runs for another 1000 steps at constant T, writes a restart file, runs for 4000 steps, clears the simulation, reads the restart, re-fixes NPT, and runs for another 4000 steps. It makes some dumps along the way. Neighbor lists are rebuilt every single time step. I ran this with both the latest stable lammps (June 2019) and a February 2018 version. 'make serial' in either case, RHEL7 machine, gcc.

My expectation is that at timestep 6000, the files dump.melt.6000 and dump.meltR.6000 will be identical. Yet, they are not. On my system, I see a difference (in box dimensions) at the step 4000 dump files, and many various differences in the step 6000 dump files. If I use the commented-out fix nvt or nve in place of all the npt fixes, then, the step 6000 dump files are identical. If I use fix npt in each place, but comment out the two neigh_modify commands, then the results at step 6000 are identical. A significantly shorter test case does not show the same error -- a certain # of time steps (or perhaps neighbor builds...) need to occur before the two series of dump files show a difference. Appending "start 1000 stop 6000 pre no" to the last 3 run commands does not result in the step 6000 dump files being identical.

I'm wondering, what is the cause of the observed difference between the original run and the restarted run, and can it be fixed? My understanding (and my limited inspection of fix_nh.cpp) is that the restart should pick up the internal state of the thermostat/barostat from the restart file, given the same fix #. And, fix nvt works fine. I tried tinkering with some fix npt options but none had an effect on the example problem. Is it perhaps related to how the simulation cell dimensions are stored in the restart file? That would surprise me, I guess, but I'm having a hard time seeing why fix nvt would not display this issue, but fix npt would (if the barostat internal variables are being restarted as anticipated). I searched the lammps-users archives but did not quickly find something with a use case identical to what I'm describing above (the most relevant thread was wanting to continue a single run with changing temperature).

Thanks,

Brian

CLASSIFICATION: UNCLASSIFIED

akohlmey · July 31, 2019, 10:02pm

it is not as unexpected as you might think. there are multiple factors at play here.

the major issue is floating point math. it is not associative and thus the order of any operation can yield slightly different results. while you are controlling the neighbor list builds (which will result in reordering of atoms), you don’t control the atom sorting (if physical proximity of atoms correlates to proximity of their data in memory, it will result in better cache efficiency). this issue can only be avoided by using fixed point math.

the second issue is that pressure/stress at normal conditions is the sum of numbers of varying magnitude with alternating sign. this means a loss of precision in the computed numbers. in short, a property like pressure is subject to more numerical noise than, say, temperature. this is made worse by how fix variable cell propagation is performed: all atoms are converted to fractional coordinates, then the cell dimensions propagated/adjusted, and then all atoms are converted back to absolute coordinates. this has the consequence, that all numerical noise, that the computed pressure experiences will be propagated to all coordinates, which in turn will then lead to larger noise in positions (compared with fix nvt and fix nve) and thus even larger noise on pressure.

so, to me, it is quite expected to see a divergence eventually and to see that this divergence is significantly pronounced with fix npt.
thus i am doubtful if it would be possible with LAMMPS to have a perfect continuation after a restart compared to an uninterrupted run.

axel.

_Barnes_Brian_C_CIV · August 1, 2019, 1:48pm

CLASSIFICATION: UNCLASSIFIED

Hi Axel and all,

Fair enough. I’m very aware of floating point math issues, but I suppose I had hoped that for the same run, on a single CPU core, things could be processed in a deterministic manner such that results could be reproduced. I admit I did not test compiler flags (e.g. optimization levels) before writing my original mail. I can do that soon. If I run that test case twice, then the respective dumps at step 6000 will be the same (from the two different invocations of LAMMPS). So we know that LAMMPS results can, in principle, be exactly reproduced. This is a particular test case involving reproduction of a run when initializing from a restart file, during the same invocation of LAMMPS.

If the consensus is that LAMMPS will never be able to address my test case, then I can work around it and move on. But if there are any differing opinions I would be curious to hear them.

Thanks,

Brian

akohlmey · August 1, 2019, 2:18pm

as i mentioned before, if you also turn off atom sorting and run on a single CPU, then there is a chance, that the starting from a restart will give the exact same trajectory than a interrupted but continued run. if you compare an uninterrupted run vs. a trajectory than i would expect a divergence eventually, and faster with fix npt than with others. i would still expect that some divergence will happen eventually, but that depends on having an intermediate result, that is what is called a “denormalized number”, which would be a floating point number that is extremely close to zero, but not quite. that is a way how noise, i.e. random bits, can enter into computations.
this is rare, but it is more likely for 32-bit x86 executables, where the standard (non-SSE) floating point unit is used by default, which has extended precision floating point operations by default). this behavior can be changed with compiler flags (for fortran there is an option -ffloat-store, that would enforce updating variables in extended precision FP registers to 64-bit precision by storing and reading them in memory after every use. i.e with a performance penalty). for 64-bit x86 executables the SSE FPU is used, which is 64-bit and thus will have less options to generate numerical noise.

axel.