Axel, thanks for your suggestion! While I had considered that my system was
just reaching different local minima, I didn't think of sampling space as
you suggest.
For peace of mind, allow me to reword my last question. As you point out, I
have seen similar questions raised in the past, but my question refers to
different output when every "input" is the same (computer, compilation, proc
grid, input file, etc).
I understand that floating math truncation can cause differences to arise
(especially across computers), but does it also explain why identical runs
yield different outputs? At the risk of being redundant, but for the sake of
clarity, here is what I do (all within the same Cray HPC):
* make LAMMPS with Intel compilers, creating lmp_exec
* run lmp.in using lmp_exec with N-processors (eg aprun -n N lmp_exec <
lmp.in) and produce dump.1.out
* change output naming variable keeping everything else within lmp.in the
same
* run lmp.in using lmp_exec with N-processors and produce dump.2.out
Comparing the dumps I see that dump.1.out != dump.2.out.
If I understood correctly, this is to be expected and I should not expect
identical output under any circumstances? Or at least where local minima
with very small transition barriers are present? The only thing I can think
of is that Intel compilers are truncating with resolution lower to LAMMPS?
it is really difficult to make specific yes/no statements based on
such vague descriptions.
essentially, anything that changes the order of execution of floating
point operations can have (small) effects.
but there are other possible causes, too, that can be triggered by
such changes (i.e. operations that change the order in which data is
stored in memory).
these are usually bugs in the code, e.g. uninitialized variables that
are accessed before they are initialized. when they are assigned to
freshly allocated memory that memory will be initialized to all zeros.
however, when they are assigned to memory that has been used and freed
before, it may have different byte patterns in them.
or those are bugs in the compiler, i.e. the compiler creates broken
executables, often when extremely high optimization levels are used
(features like IPO are often very problematic). sometimes, it may also
be due to broken hardware, as during testing, you may always get
assigned to the same node.
the problem here is, that correlation doesn't always mean causation.
we often find bugs by switching compilers and certain compilers are
more likely to generate code that triggers crashes due to bugs in the
code. on the other hand, certain compilers have a reputation for being
broken more often than others. certain compiler versions are known to
be sensitive to certain code constructs. however, the latter usually
happens with "unusual" code, e.g. that makes heavy use of "modern"
features, like KOKKOS or that uses OpenMP or vectorization. these
usually become less of an issue with newer compiler releases as those
features mature and more regressions are reported.
Lastly, to add to the confusion, I want to point out that I do observe
dump.1.out==dump.2.out for:
* non-Cray HPCs
* Cray+gnu compilers (instead of Intel) and
* my local workstation.
and only see dump.1.out != dump.2.out for:
* Cray+Intel compilers
as explained above, you cannot look at this in this abstract fashion.
first you need to find out (through stability/sensitivity analysis)
whether your starting input is in a divergent or in a stable section
of your available phase space.
assuming that you are in a stable area, then all
compilers/processor/hardware combinations should lead to pretty much
the same results (within the limitations of the model and floating
point restrictions), if not all bets are off and you have to rethink
whether your calculations will lead to meaningful results in the first
place.
if you still get divergent results, but only with some compilers, you
should make certain, that you have the very latest development code
(best from the git/svn repo) and test with that (if you are on the
path to expose a bug in LAMMPS, it will only be fixed based on that
source code version). then you should try if you can get access to
different compiler versions from the same vendor. sometimes an update
can make compiler based issues go away.
if you still have reproducible problems, try to reduce your input to
the absolute minimal size that still reproduces it (ideal is, if you
can trigger the situation with runs that take no more than 1 min on a
10 core workstation) and provide the input here on the mailing list,
or post them as an issue on the github issue tracker for LAMMPS.
then we'll try reproducing and evaluating them.
the situation is a bit tricky, since people with limited experience in
debugging often make wrong assumptions and most problems reported
here, that were attributed to the code, tend to be issues in input or
parameters. and even within input parameters, people have a tendency
to assume peripheral reasons rather than fundamental and elementary
ones. yet, every once in a while, the same - easily dismissed -
symptoms can be indication of a real bug (be it in the source code or
the compiler or in support libraries).
especially for a code of the size and complexity of LAMMPS it is
difficult to guarantee it is bug free. the is all the more true for
less used and less tested contributed code components.
axel.