Image flags in restart files

sjplimp · November 6, 2013, 3:07pm

Please post to the list, not just to me.

I don’t know why this is happening, since I can’t reproduce it.
Maybe Axel will have an idea. You are running this on
a single proc, on what machine? Is there anything different
about the machine, e.g. is its endian order for binary data
different than most other machines?

Steve

akohlmey · November 6, 2013, 3:26pm

Please post to the list, not just to me.

I don't know why this is happening, since I can't reproduce it.

neither can i.

Maybe Axel will have an idea. You are running this on

i would suspect a compiler issue. if you can install an alternate GCC
compiler to see if the issue carries over would be helpful to know.

a single proc, on what machine? Is there anything different

it would be helpful to have the output of

uname -a
cat /etc/issue
gcc -v 2>&1 | tail -1

Chris_Brackley · November 7, 2013, 2:41pm

I think I've now found a way to resolve the problem. It does seem to be a compilation issue, and I've found that changing some of the compiler options fixes the problem.

I've been running on an intel x86_64 machine; output of requested commands:
uname \-a Linux frontend04 2\.6\.32\-220\.23\.1\.el6\.x86\_64 \#1 SMP Mon Jun 18 09:58:09 CDT 2012 x86\_64 x86\_64 x86\_64 GNU/Linux cat /etc/issue
Scientific Linux release 6.2 (Carbon)
Kernel \r on an \m
$ gcc -v 2>&1 | tail -1
gcc version 4.4.6 20110731 (Red Hat 4.4.6-3) (GCC)

and I've been compiling using the provided Makefile.openmpi without any modification. I also tried compiling with gcc 4.4.7, and MPICH2 instead of OpenMPI (and combinations of those) and always got the same bug. I tried compiling with Intel c++ compiler and MPICH using Makefile.linux, and this does not reproduce the error; compiling without mpi using Makefile.serial using the same gcc versions also does not reproduce the error.

I noticed that there are several compiler optimisation options in the provided Make.openmpi, and doing some experimentation found that removing the -funroll-loops option resolved the problem. I don't know much about these optimization options, so don't I don't have any idea why this could cause the problem, and whether its a general issue, or specific to my machine set-up. Also this makefile worked fine for earlier versions of lammps (I tried 21Feb2013).

Thanks for your help,
Chris

akohlmey · November 7, 2013, 2:52pm

I think I've now found a way to resolve the problem. It does seem to be a
compilation issue, and I've found that changing some of the compiler options
fixes the problem.

I've been running on an intel x86_64 machine; output of requested commands:
uname \-a Linux frontend04 2\.6\.32\-220\.23\.1\.el6\.x86\_64 \#1 SMP Mon Jun 18 09:58:09 CDT 2012 x86\_64 x86\_64 x86\_64 GNU/Linux cat /etc/issue
Scientific Linux release 6.2 (Carbon)
Kernel \r on an \m
$ gcc -v 2>&1 | tail -1
gcc version 4.4.6 20110731 (Red Hat 4.4.6-3) (GCC)

and I've been compiling using the provided Makefile.openmpi without any
modification. I also tried compiling with gcc 4.4.7, and MPICH2 instead of
OpenMPI (and combinations of those) and always got the same bug. I tried
compiling with Intel c++ compiler and MPICH using Makefile.linux, and this
does not reproduce the error; compiling without mpi using Makefile.serial
using the same gcc versions also does not reproduce the error.

yes. this is the kind of issue that i was suspecting is the case and
that is why i was asking for the details. this only affects the gcc
4.4.x that is shipped with RHEL 6.x or CentOS 6.x and when you use -O2
with several additional flags, particularly -funroll-loops.

it seems to miscompile this piece of code:

buf[m] = 0.0; // for valgrind
*((tagint *) &buf[m++]) = image[i];

in the various AtomVec classes. the older version that was working
correctly, didn't have the line with '// for valgrind'.

i suspect we'll be better off to have valgrind complain and compilers
not miscompile this until somebody has the time to re-implement this
piece using a proper union that will remove the need for the
legal-but-somewhat-unusal code in the second line.

axel.