Building LAMMPS on NEC-SX

Derek_Ashley_Thomas · January 22, 2013, 3:35am

Dear all,

I was wondering if anyone has experience building LAMMPS for an NEC-SX super-computer because I have been having problems with it. Essentially, I haven’t had any problems with pair potentials, but force calculations eventually become NAN for the airebo potential when atoms are bonded in chains longer than 2 atoms. I am, slowly, attempting to weed out the problem, but there is no interactive login abilities on SX nodes. I have to wait for a job to process on the developer-queue and print out something. I could not get Valgrind to compile for this system, but I could coerce fftw and a gmake to build by patching the configure scripts.

The problem is likely due to overzealous optimizations, but peculiarly, all optimization options lead to this behavior except “no-optimization”, which also leaves me with a 150 MB binary executable (it barely fits in my home directory). That means even “safe” optimizations cause this strange behavior.

Interestingly, this does not cause LAMMPS to crash because it continues to run all time steps, so perhaps there is simply a conversion to print format that is going wrong instead. I had expected LAMMPS to crash with NAN forces, but perhaps this is expected. The version of lammps I am using is a git-checkout from Jan-15-2013 (0c292ed533831f2b4298c2656f587a65b9b596e5).

I have also noticed that if I build lammps with gcc on the interactive shell (does not run on the SX system), it runs faster than the cross-compiled version on the supercomputer and doesn’t produce NAN forces of course.

I just wanted to know if anyone has experience with patching LAMMPS to work on an unfriendly system like this or if you have any suggestions on where I may not have looked to figure this out. I will gladly provide any patches if I figure this out.

Best,
Derek Thomas

akohlmey · January 22, 2013, 8:52am

Dear all,

I was wondering if anyone has experience building LAMMPS for an NEC-SX
super-computer because I have been having problems with it. Essentially, I

not for LAMMPS but other (fortran) codes. in essence, it was a major PITA.

[...]

The problem is likely due to overzealous optimizations, but peculiarly, all
optimization options lead to this behavior except "no-optimization", which

not so much "optimization" but "vectorization". the problem is that the
LAMMPS code isn't written in a way that it vectorizes well. for simple
code constructs, the compiler can often safely detect data dependencies,
but particularly the AIREBO code is pretty messy (i've had a close look
when adding multi-threading to it, where i needed to deal with similar issues).

also leaves me with a 150 MB binary executable (it barely fits in my home
directory). That means even "safe" optimizations cause this strange
behavior.

yup. not unexpectedly.

Interestingly, this does not cause LAMMPS to crash because it continues to
run all time steps, so perhaps there is simply a conversion to print format
that is going wrong instead. I had expected LAMMPS to crash with NAN forces,
but perhaps this is expected. The version of lammps I am using is a

yup. this is standard IEEE-754 math behavior. there may be a compiler
flag to change this into causing a core dump instead. in most cases this
will result in adding overhead, so it is not done (the only exceptions i know
are the DEC Alpha and the Sun UltraSparc families of CPUs)

git-checkout from Jan-15-2013 (0c292ed533831f2b4298c2656f587a65b9b596e5).

I have also noticed that if I build lammps with gcc on the interactive shell
(does not run on the SX system), it runs faster than the cross-compiled
version on the supercomputer and doesn't produce NAN forces of course.

yup. that is what i would expect as well, since there is not going to be
efficient vectorization. in the old times, they wouldn't even let you come
near a machine like the earth simulator with non-vector optimized code.

I just wanted to know if anyone has experience with patching LAMMPS to work
on an unfriendly system like this or if you have any suggestions on where I
may not have looked to figure this out. I will gladly provide any patches if
I figure this out.

you'll have to rewrite large parts of the LAMMPS kernels to get good
vectorization. not sure how easy it would be for AIREBO in particular.
there are a few people currently pondering options how to implement
this into LAMMPS in a way that will be flexible enough so that it can
be used on multiple compute infrastructures with somewhat similar
requirements (GPU, Intel Phi, multi-core CPUs with long vector registers
(AVX,AVX2)), but don't hold your breath.

perhaps somebody else has some additional comments.
i would suggest to make a deal with somebody e.g. doing climate
modeling and swap the CPU time on the vector machine with
time on a regular linux cluster. that'll make things easy for everybody.

ciao,
axel.

Derek_Ashley_Thomas · January 22, 2013, 2:42pm

> The problem is likely due to overzealous optimizations, but peculiarly,
all
> optimization options lead to this behavior except "no-optimization",
which

not so much "optimization" but "vectorization". the problem is that the
LAMMPS code isn't written in a way that it vectorizes well. for simple
code constructs, the compiler can often safely detect data dependencies,
but particularly the AIREBO code is pretty messy (i've had a close look
when adding multi-threading to it, where i needed to deal with similar
issues).

Yes, I figured out that most of lammps doesn't vectorize well, but when I
set the cross-compiler to optimize without vectorization, the same problem
persists. The airebo code is quite a beast to behold. This makes me think
wonder if there is more to the story here. On the other hand, the
cross-compiler may simply be setup to ignore my request to use
non-vectorized optimization, but I somehow doubt that.

> Interestingly, this does not cause LAMMPS to crash because it continues
to
> run all time steps, so perhaps there is simply a conversion to print
format
> that is going wrong instead. I had expected LAMMPS to crash with NAN
forces,
> but perhaps this is expected. The version of lammps I am using is a

yup. this is standard IEEE-754 math behavior. there may be a compiler
flag to change this into causing a core dump instead. in most cases this
will result in adding overhead, so it is not done (the only exceptions i
know
are the DEC Alpha and the Sun UltraSparc families of CPUs)

Thanks for the clarification, I should have considered the overhead of
that, it didn't even occur to me.

yup. that is what i would expect as well, since there is not going to be

efficient vectorization. in the old times, they wouldn't even let you come
near a machine like the earth simulator with non-vector optimized code.

That's what I feared. It's helpful that you expected that, now I know
better.

you'll have to rewrite large parts of the LAMMPS kernels to get good
vectorization. not sure how easy it would be for AIREBO in particular.
there are a few people currently pondering options how to implement
this into LAMMPS in a way that will be flexible enough so that it can
be used on multiple compute infrastructures with somewhat similar
requirements (GPU, Intel Phi, multi-core CPUs with long vector registers
(AVX,AVX2)), but don't hold your breath.

I think that reoptimizing the LAMMPS core to be vectorizable would be a
good goal for the future of LAMMPS. However, this is tricky because it
would also be less maintainable, not to mention that different
architectures may require e.g. `#ifdef SX` etc.. I'll certainly look into
how this could be achieved. If anyone is interested in discussing this
further, I'd certainly be interested in some discussing it.

perhaps somebody else has some additional comments.

i would suggest to make a deal with somebody e.g. doing climate
modeling and swap the CPU time on the vector machine with
time on a regular linux cluster. that'll make things easy for everybody.

Thanks for the advice, I'll focus on using linux clusters for my current
work. I have several options available, luckily. I was hoping to gain a
large number of cores. Unfortunately, my current setup has 8-core nodes and
it seems to have a very high communication overhead, which has thus far
limited me to only one node to a job. Hopefully I can find a way to
distribute it better and gain an extra node, but so far benchmarks are not
very good. I'm not sure if USER-OMP might help me out with this or not.

Cheers,
Derek Thomas