Minimization serial vs parallel

Hello all

I’m trying to determine the lowest energy configuration for a precipitate-matrix system (r_p~1nm / matrix ~7nmX7nmX7nm).

For serial, I get an ordered precipitate and for parallel the precipitate is disordered. After playing with input parameters, trying different potentials (adp and eam), min_styles (cg, fire, sd), etc. I noticed that increasing the tolerance for the parallel minimization yields a structure similar to the serial run.

I ran several cases where the only thing that changed were the tolerances: from 1e-15 (as the serial case) to 1e-8. By looking at the displacements, the parallel run that best matches the serial run is for etol=ftol=1e-13, i.e. the parallel run is 2 orders of magnitude higher than the serial run. Further, if etol=ftol>1e-14 I do get an ordered precipitate.

My questions are:
Is this behavior to be expected? I would have expected to be able to obtain similar (ordering and disordering behavior) for the same input regardless of whether it was run in serial or parallel.

Are the tolerances unnecessarily too low? I would imagine they are still within reason.

I assume that there is some accuracy lost in the proc comm, but enough to reach two very different structures? If so, is it recommended that I continue using the higher tolerance threshold that best matched the serial run?

Lastly, I wanted to mention that I came across an unusual issue on Cray systems (at least 3 HPCs). I was compiling LAMMPS (latest devel and stable distros) with Intel compilers. By happenstance, I noticed that if I ran the same input (literally identical in every way, but a comment line and an output naming variable), I could end up with two different structures. After several emails with HPC support they suggested I use GNU compilers. That seemed to work, but what I find fascinating is that some of the “random” output from the Intel compilation matched the serial runs very well. Comments?

Thanks

Efrain

Hello all

I'm trying to determine the lowest energy configuration for a
precipitate-matrix system (r_p~1nm / matrix ~7nmX7nmX7nm).

For serial, I get an ordered precipitate and for parallel the precipitate is
disordered. After playing with input parameters, trying different potentials
(adp and eam), min_styles (cg, fire, sd), etc. I noticed that increasing
the tolerance for the parallel minimization yields a structure similar to
the serial run.

I ran several cases where the only thing that changed were the tolerances:
from 1e-15 (as the serial case) to 1e-8. By looking at the displacements,
the parallel run that best matches the serial run is for etol=ftol=1e-13,
i.e. the parallel run is 2 orders of magnitude higher than the serial run.
Further, if etol=ftol>1e-14 I do get an ordered precipitate.

actually, those numbers are *smaller* not larger.

My questions are:
Is this behavior to be expected? I would have expected to be able to obtain
similar (ordering and disordering behavior) for the same input regardless of
whether it was run in serial or parallel.

this very much depends on your initial structure. from what you
describe, it seems that you may have an initial structure that may sit
on a saddle point between two kinds of minima and then even the
tiniest differences can push your system towards either of the two
options. since LAMMPS does floating point math, which is not
associative, the exact value of per-atom forces depend on the order
with which forces are summed. this can be different due to a variety
of reasons.

you should also keep in mind, that minimization algorithms are not
guaranteed to find the global minimum and that for a sufficiently
rugged and highly dimensional potential hypersurface, the minimizer
may get easily "stuck" in different minima. the only way to address
this would be to generate a sufficiently large number of decorrelated
initial configurations that are way above the potential barriers and
then compare the result. problem is: it is near impossible to know,
what is a sufficiently large number of starting configurations and
what is sufficiently high in energy. this is a general problem of
minimization of highly dimensional systems.

Are the tolerances unnecessarily too low? I would imagine they are still
within reason.

some people set them to 0.0, even. please note that there are multiple
criteria that cause minimizations to terminate and tolerance is only
one of them.

I assume that there is some accuracy lost in the proc comm, but enough to
reach two very different structures? If so, is it recommended that I
continue using the higher tolerance threshold that best matched the serial
run?

it is not really a matter of the minimization tolerance. the
correlation you see is not a causation.

Lastly, I wanted to mention that I came across an unusual issue on Cray
systems (at least 3 HPCs). I was compiling LAMMPS (latest devel and stable
distros) with Intel compilers. By happenstance, I noticed that if I ran the
same input (literally identical in every way, but a comment line and an
output naming variable), I could end up with two different structures. After
several emails with HPC support they suggested I use GNU compilers. That
seemed to work, but what I find fascinating is that some of the "random"
output from the Intel compilation matched the serial runs very well.
Comments?

see above. the intel compilers apply different optimizations, e.g.
they usually are more thorough at applying vectorization, which will
lead to floating-point math truncation effects, which leads to the
same kind of small variations in forces than what you'd get from
parallel vs. serial. this whole thing is a rather frequently discussed
item and is no surprise to people with some knowledge about floating
point math and solving highly dimensional coupled linear differential
equations.

what i would recommend to do is a "stability analysis", i.e. redo your
minimizations after applying several (small) random displacements,
e.g. via the displace_atoms command (each with different RNG seeds!).
...and particularly *before* starting the minimization, but also
after. this will allow you to assess, how much of the results are
influenced by any symmetries in your initial (or final) configuration.

axel.

Axel, thanks for your suggestion! While I had considered that my system was just reaching different local minima, I didn’t think of sampling space as you suggest.

For peace of mind, allow me to reword my last question. As you point out, I have seen similar questions raised in the past, but my question refers to different output when every “input” is the same (computer, compilation, proc grid, input file, etc).

I understand that floating math truncation can cause differences to arise (especially across computers), but does it also explain why identical runs yield different outputs? At the risk of being redundant, but for the sake of clarity, here is what I do (all within the same Cray HPC):

  • make LAMMPS with Intel compilers, creating lmp_exec

  • run lmp.in using lmp_exec with N-processors (eg aprun -n N lmp_exec < lmp.in) and produce dump.1.out

  • change output naming variable keeping everything else within lmp.in the same

  • run lmp.in using lmp_exec with N-processors and produce dump.2.out

Comparing the dumps I see that dump.1.out != dump.2.out.

If I understood correctly, this is to be expected and I should not expect identical output under any circumstances? Or at least where local minima with very small transition barriers are present? The only thing I can think of is that Intel compilers are truncating with resolution lower to LAMMPS?

Lastly, to add to the confusion, I want to point out that I do observe dump.1.out==dump.2.out for:

  • non-Cray HPCs
  • Cray+gnu compilers (instead of Intel) and
  • my local workstation.

and only see dump.1.out != dump.2.out for:

  • Cray+Intel compilers

Thanks for any clarifications!

Efrain

Axel, thanks for your suggestion! While I had considered that my system was
just reaching different local minima, I didn't think of sampling space as
you suggest.

For peace of mind, allow me to reword my last question. As you point out, I
have seen similar questions raised in the past, but my question refers to
different output when every "input" is the same (computer, compilation, proc
grid, input file, etc).

I understand that floating math truncation can cause differences to arise
(especially across computers), but does it also explain why identical runs
yield different outputs? At the risk of being redundant, but for the sake of
clarity, here is what I do (all within the same Cray HPC):

* make LAMMPS with Intel compilers, creating lmp_exec
* run lmp.in using lmp_exec with N-processors (eg aprun -n N lmp_exec <
lmp.in) and produce dump.1.out
* change output naming variable keeping everything else within lmp.in the
same
* run lmp.in using lmp_exec with N-processors and produce dump.2.out

Comparing the dumps I see that dump.1.out != dump.2.out.

If I understood correctly, this is to be expected and I should not expect
identical output under any circumstances? Or at least where local minima
with very small transition barriers are present? The only thing I can think
of is that Intel compilers are truncating with resolution lower to LAMMPS?

it is really difficult to make specific yes/no statements based on
such vague descriptions.
essentially, anything that changes the order of execution of floating
point operations can have (small) effects.

but there are other possible causes, too, that can be triggered by
such changes (i.e. operations that change the order in which data is
stored in memory).
these are usually bugs in the code, e.g. uninitialized variables that
are accessed before they are initialized. when they are assigned to
freshly allocated memory that memory will be initialized to all zeros.
however, when they are assigned to memory that has been used and freed
before, it may have different byte patterns in them.
or those are bugs in the compiler, i.e. the compiler creates broken
executables, often when extremely high optimization levels are used
(features like IPO are often very problematic). sometimes, it may also
be due to broken hardware, as during testing, you may always get
assigned to the same node.

the problem here is, that correlation doesn't always mean causation.
we often find bugs by switching compilers and certain compilers are
more likely to generate code that triggers crashes due to bugs in the
code. on the other hand, certain compilers have a reputation for being
broken more often than others. certain compiler versions are known to
be sensitive to certain code constructs. however, the latter usually
happens with "unusual" code, e.g. that makes heavy use of "modern"
features, like KOKKOS or that uses OpenMP or vectorization. these
usually become less of an issue with newer compiler releases as those
features mature and more regressions are reported.

Lastly, to add to the confusion, I want to point out that I do observe
dump.1.out==dump.2.out for:
* non-Cray HPCs
* Cray+gnu compilers (instead of Intel) and
* my local workstation.

and only see dump.1.out != dump.2.out for:
* Cray+Intel compilers

as explained above, you cannot look at this in this abstract fashion.
first you need to find out (through stability/sensitivity analysis)
whether your starting input is in a divergent or in a stable section
of your available phase space.
assuming that you are in a stable area, then all
compilers/processor/hardware combinations should lead to pretty much
the same results (within the limitations of the model and floating
point restrictions), if not all bets are off and you have to rethink
whether your calculations will lead to meaningful results in the first
place.
if you still get divergent results, but only with some compilers, you
should make certain, that you have the very latest development code
(best from the git/svn repo) and test with that (if you are on the
path to expose a bug in LAMMPS, it will only be fixed based on that
source code version). then you should try if you can get access to
different compiler versions from the same vendor. sometimes an update
can make compiler based issues go away.

if you still have reproducible problems, try to reduce your input to
the absolute minimal size that still reproduces it (ideal is, if you
can trigger the situation with runs that take no more than 1 min on a
10 core workstation) and provide the input here on the mailing list,
or post them as an issue on the github issue tracker for LAMMPS.
then we'll try reproducing and evaluating them.

the situation is a bit tricky, since people with limited experience in
debugging often make wrong assumptions and most problems reported
here, that were attributed to the code, tend to be issues in input or
parameters. and even within input parameters, people have a tendency
to assume peripheral reasons rather than fundamental and elementary
ones. yet, every once in a while, the same - easily dismissed -
symptoms can be indication of a real bug (be it in the source code or
the compiler or in support libraries).
especially for a code of the size and complexity of LAMMPS it is
difficult to guarantee it is bug free. the is all the more true for
less used and less tested contributed code components.

axel.

If you are running the same script on the same # of

procs on the same machine, you should normally get

identical answers. Assuming no bugs in the code,

as Axel mentioned.

The only exception I can think of at the moment,
is if you are using a command that

invokes what LAMMPS calls “irregular communication”.

Examples are the load balancer (fix balance)

or fix deform (when it flips the box).

For this operation, each proc can send its atoms

to a variety of other procs. All procs do this

at the same time. The order that received atoms

arrive can be randomly different on each proc.

Which means that in two runs, atoms might end

up differently ordered on a particular step. This

will lead to subsequent round-off divergence of the

trajectories.

However, I doubt you are using either of those commands

in a minimization.

Steve