Argument N of dump command affects results

Stefan_Parviainen · June 14, 2016, 12:11pm

Dear all,

Earlier I ran a simulation using LAMMPS (10 Feb 2015) with an input
containing a line like the following:

dump 7 all atom 30000 out.lammpstrj

As the interval between dumps is quite large, I wanted to run the same
simulation, but with a smaller interval between dumps. I modified the
line to read

dump 7 all atom 300 out.lammpstrj

However, I now get results that diverge from the original simulation (in
the beginning atom positions etc. are similar, but the difference starts
growing). If I rerun the simulation using the original dump interval, I
do get exactly the same result as in the original simulation. Is this
expected behaviour? I'm surprised that just changing the dump interval
has an effect on the simulation itself. Is there any way to avoid this
effect?

I'm running on a Linux system using 24 cores (4 by 1 by 6 MPI processor
grid).

akohlmey · June 14, 2016, 1:06pm

Dear all,

Earlier I ran a simulation using LAMMPS (10 Feb 2015) with an input
containing a line like the following:

dump 7 all atom 30000 out.lammpstrj

As the interval between dumps is quite large, I wanted to run the same
simulation, but with a smaller interval between dumps. I modified the
line to read

dump 7 all atom 300 out.lammpstrj

However, I now get results that diverge from the original simulation (in
the beginning atom positions etc. are similar, but the difference starts
growing). If I rerun the simulation using the original dump interval, I
do get exactly the same result as in the original simulation. Is this
expected behaviour? I'm surprised that just changing the dump interval
has an effect on the simulation itself. Is there any way to avoid this
effect?

are you sure, it is only the dump frequency that you have changed and
not also things like number of processors, LAMMPS version, compiler
version, hardware?
as has been discussed regularly on this mailing list, MD simulations
are chaotic, i.e. the tiniest change will have eventually lead to an
exponential divergence of trajectories. since LAMMPS uses
floating-point math, which is not associative, any change in the order
of operations can lead to such small differences. this order can be
altered by changing number of processors, small changes in the source
code, compiler optimizations and sometimes even tiny differences due
to different hardware.

if you look this up in a proper MD and/or stat mech text book, you
should see that comparing such diverging trajectories are actually a
good thing, as you would need to obtain the same statistical
mechanical averages and thus thermodynamic data from either of those
trajectories. there are methods like PRD that take advantage of this
and actively decorrelate trajectories via randomization and thus
reduce the time to solution through running multiple decorrelated MD
trajectories side-by-side and thus improving sampling of the
statistical mechanical ensemble.

axel.

Stefan_Parviainen · June 14, 2016, 1:46pm

Thanks. I re-ran the simulation a few times with identical input files,
lammps version etc., and it seems like the result indeed is random, so
it's not the dump command that affects the result. However, from a
scientific point of view it would be nice to have reproducible
simulations. Is there any way to achieve this easily, even if it means
slowing down simulations somewhat?

akohlmey · June 14, 2016, 2:01pm

Dear all,

Earlier I ran a simulation using LAMMPS (10 Feb 2015) with an input
containing a line like the following:

dump 7 all atom 30000 out.lammpstrj

As the interval between dumps is quite large, I wanted to run the same
simulation, but with a smaller interval between dumps. I modified the
line to read

dump 7 all atom 300 out.lammpstrj

However, I now get results that diverge from the original simulation (in
the beginning atom positions etc. are similar, but the difference starts
growing). If I rerun the simulation using the original dump interval, I
do get exactly the same result as in the original simulation. Is this
expected behaviour? I'm surprised that just changing the dump interval
has an effect on the simulation itself. Is there any way to avoid this
effect?

are you sure, it is only the dump frequency that you have changed and
not also things like number of processors, LAMMPS version, compiler
version, hardware?
as has been discussed regularly on this mailing list, MD simulations
are chaotic, i.e. the tiniest change will have eventually lead to an
exponential divergence of trajectories. since LAMMPS uses
floating-point math, which is not associative, any change in the order
of operations can lead to such small differences. this order can be
altered by changing number of processors, small changes in the source
code, compiler optimizations and sometimes even tiny differences due
to different hardware.

if you look this up in a proper MD and/or stat mech text book, you
should see that comparing such diverging trajectories are actually a
good thing, as you would need to obtain the same statistical
mechanical averages and thus thermodynamic data from either of those
trajectories. there are methods like PRD that take advantage of this
and actively decorrelate trajectories via randomization and thus
reduce the time to solution through running multiple decorrelated MD
trajectories side-by-side and thus improving sampling of the
statistical mechanical ensemble.

axel.

Thanks. I re-ran the simulation a few times with identical input files,
lammps version etc., and it seems like the result indeed is random, so
it's not the dump command that affects the result. However, from a
scientific point of view it would be nice to have reproducible
simulations. Is there any way to achieve this easily, even if it means
slowing down simulations somewhat?

some suggestions:

a) i'd claim that you are mixing up reproducible with repeatable. even
if you don't get the identical trajectories, you should reproduce the
same thermodynamic properties in the ensemble average. that said,
under perfectly identical circumstances, you should get identical
trajectories.

b) please try with the very latest patchlevel. there is an ongoing
effort to improve the code quality and catch causes of inconsistencies
like you describe created by programming oversights. since last
summer, we have squashed literally hundreds of small inconsistencies
and coding oversights.

c) if this persists, please post a small(!) and easy to reproduce test
case, so that we can investigate further. without knowing the details
of your input it is difficult to say, how repeatable your trajectories
should be.

axel.

sjplimp · June 14, 2016, 2:10pm

If you’re running the same executable on the same # of procs on the

same machine, and not using any command which introduces

randomness (there are a few), then you shouldn’t get non-deterministic

results. If you think you meet those criteria, then please

post your input script.

Steve

Stefan_Parviainen · June 15, 2016, 3:10pm

I'm using a cluster, so in fact sometimes the job was spread over
several nodes, and it turns out there are even two different types of
nodes (with different CPUs). So this may explain the difference. But it
seems that there is only a small number of different possible
trajectories, so I ended up restarting the simulation a few times until
I ended up with the same trajectory as the original simulation (since
the divergence starts very fast it's quick to see if you have the
correct trajectory or not).

From now on I will try to record the exact configuration each simulation
ends up running on, so that it's easier to repeat a simulation if
needed. Would it perhaps make sense to add this kind of information to
the default lammps output, or at least print it optionally? The MPI
processor grid is already printed out, maybe optionally print out the
name of the actual MPI nodes used, software versions, etc.?

akohlmey · June 15, 2016, 3:28pm

I'm using a cluster, so in fact sometimes the job was spread over
several nodes, and it turns out there are even two different types of
nodes (with different CPUs). So this may explain the difference. But it
seems that there is only a small number of different possible
trajectories, so I ended up restarting the simulation a few times until
I ended up with the same trajectory as the original simulation (since
the divergence starts very fast it's quick to see if you have the
correct trajectory or not).

it is difficult to comment on this, as your descriptions remain vague
and you don't provide any inputs and matching differing outputs to
compare to. as steve already mentioned, certain flags and settings in
LAMMPS can promote or cause divergence of trajectories, e.g. when
using a different number of CPUs. often the divergence of trajectories
is accelerated by aggressive settings for the simulation parameters
(e.g. time step) and or loose convergence for kspace or constraint
solvers.

From now on I will try to record the exact configuration each simulation
ends up running on, so that it's easier to repeat a simulation if
needed. Would it perhaps make sense to add this kind of information to
the default lammps output, or at least print it optionally? The MPI
processor grid is already printed out, maybe optionally print out the
name of the actual MPI nodes used, software versions, etc.?

some of that is possible already using the command "info config" in
your LAMMPS input, some is better done from within your job submit
script where the information is readily available, some others could
be added (e.g. compiler version and flags). however, it will never be
complete as there are far too many permutations of factors that may
impact this. if you want repeatable simulations beyond, say several
1000 time steps, you'd need to write an MD code that uses fixed point
math (e.g. through scaled integers).

axel.

Stefan_Parviainen · June 15, 2016, 3:36pm

Thanks, I didn't know about this command. There seems to be a bug in the
documentation, as on the page
http://lammps.sandia.gov/doc/Section_commands.html the "info" command is
only listed under "3.5. Individual commands" but not under "3.4.
Commands listed by category".

_Nigel · June 15, 2016, 10:04pm

I have observed similar issues when changing the thermo output frequency
from a restart. I will try to build a small test case to send in.

Nigel

athomps · August 31, 2016, 5:12pm

This issue is really about non-deterministic behavior. In general, LAMMPS simulations are deterministic. As Steve said “If you’re running the same executable on the same # of procs on the same machine, and not using any command which introduces randomness (there are a few), then you shouldn’t get non-deterministic results.” Even simulations that introduce (pseudo)randomness normally do so in a deterministic way, by using a user-specified PRNG seed.

So, if anyone is observing non-deterministic behavior, this is most likely due to non-deterministic hardware e.g. a compute environment that allocates jobs to two different kinds of processors, as was already mentioned. Other more arcane possibilities exist, such as non-deterministic storage allocation combined with numerical precision that depends on storage location. And then there is flat-out buggy hardware or system software. Only if these possibilities can be eliminated should you start considering the possibility that LAMMPS itself contains some non-deterministic code e.g. a race condition between MPI processes.

Aidan

sjplimp · September 1, 2016, 1:54pm

The category section is not all-inclusive; just representative

of different kinds of commands.

Steve

Stefan_Parviainen · September 1, 2016, 2:06pm

Section 3.4 does start by stating "This section lists all LAMMPS commands, grouped by category. The next section lists the same commands alphabetically. "

There seems to be two problems:

Section 3.4 does not list all LAMMPS commands (as is claimed in the text)
The next section does not list the same commands (as is claimed in the text)

Perhaps the text in the beginning of section 3.4 should be reworded as “This section lists most LAMMPS commands, grouped by category. The next section provides a complete alphabetical list.”

The text in the beginning of section 3.5 also needs to be modified to not claim that section 3.4 lists the same commands.