USER-OMP package out of range atoms

Hi lammps-users,

I am trying to run simulations using the USER-OMP package. When I don’t make use of OMP (without suffix omp), everything works.
My jobs are running on nodes with 8 cores, so what I did was (I hope) pretty much following the documentation and submitted a job on 2 cores with 4 MP threads:

#!/bin/bash
# -N RH095 # -cwd
# -pe ompi 8 # -l h_rt=01:00:00

echo $PE $PE_HOSTFILE $NSLOTS $NHOSTS $HOME
cat $PE_HOSTFILE

mpirun -V -m $PE_HOSTFILE -x $OMP_NUM_THREADS=4 -np 2 $HOME/bin/lmp_linuxOMP -log log.OMP2by4 -in in.calcForcesOMP

The problem is that the job runs for a while and after a few hundred steps it dies due to out of range atoms.
The complete output is here: http://pastebin.com/GSZcWcbg

So my question is what causes this error. Since the job runs smoothly in the beginning, I guess my setup is correct.

Thanks in advance,

Nikita

Hi lammps-users,

I am trying to run simulations using the USER-OMP package. When I don't make
use of OMP (without suffix omp), everything works.
My jobs are running on nodes with 8 cores, so what I did was (I hope) pretty
much following the documentation and submitted a job on 2 cores with 4 MP
threads:

#!/bin/bash
# \-N RH095 \# -cwd
# \-pe ompi 8 \# -l h_rt=01:00:00

echo $PE $PE_HOSTFILE $NSLOTS $NHOSTS $HOME
cat $PE_HOSTFILE

mpirun -V -m $PE_HOSTFILE -x $OMP_NUM_THREADS=4 -np 2 $HOME/bin/lmp_linuxOMP
-log log.OMP2by4 -in in.calcForcesOMP

The problem is that the job runs for a while and after a few hundred steps
it dies due to out of range atoms.
The complete output is here: http://pastebin.com/GSZcWcbg

So my question is what causes this error. Since the job runs smoothly in the

most likely a bug in one of the /omp styles.
there are very many of them now and despite
the best efforts, bugs crawl into them every
once in a while.

beginning, I guess my setup is correct.

are you running the latest USER-OMP code from my github
repository. most likely, pppm/omp has a problem. i ran into
something similar myself last night that indicates that pppm/omp
may have a bug. i have not yet been able to identify the exact
cause of that one. please try with plain pppm instead.

thanks,
    axel.

Hi Axel,

Thanks for the quick reply. I’m running the Nov09 version of LAMMPS (downloaded today or actually yesterday, to be correct). If I don’t use suffix omp with the exact same binary, everything works like a charm. So the problem has something to do with using the omp package I guess.

Cheers,

Nikita

Hi Axel,

hi nikita,

Thanks for the quick reply. I'm running the Nov09 version of LAMMPS
(downloaded today or actually yesterday, to be correct). If I don't use

there is currently a _big_ difference in OpenMP support between the
official LAMMPS version and my development tree that i have on my
github account. the github version is _much_ more efficient in terms
of threading support, so if you are serious to get the most parallel
efficiency, you should use the github version.

suffix omp with the exact same binary, everything works like a charm. So the
problem has something to do with using the omp package I guess.

yes. and as i said, it is likely in the pppm/omp style.
i am seeing some issues with it on my local machine,
but have not been able to identify them.

so rather than using -sf omp, please just add
/omp to all supported styles except pppm and
see if it runs correctly. or if you want to be
more lazy, just change pppm into pppm/cg
and then try -sf omp again. it should run
correctly, at least with the github version.

unfortunately, steve has been too busy recently
to check and include the (many) threading related
changes that have accumulated in my tree.

cheers,
    axel.

quick update. the bug in pppm/omp is found and fixed.

git://github.com/akohlmey/lammps-omp.git

axel.

hi axel,

thanks, i’ll get my hands on it as soon as possible!
the reason i want to use omp is that i have a system with a very inhomogeneous distribution of atoms which (i believe) makes the computational time on ‘normal’ lammps unacceptably high. so i am hoping for some performance boost with the omp package.

anyway, thanks again. i’ll report on the results i get

nikita

hi axel,

compiling your version, i have issues with read_data - it seems not to extract information from my data file:

LAMMPS (9 Nov 2011)
using 1 OpenMP thread(s) per MPI task
reset 4 OpenMP thread(s) per MPI task
Scanning data file …
0 = max bonds/atom
0 = max angles/atom
ERROR on proc 0: Needed topology not in data file (read_data.cpp:1298)

i tested the data file with a previous version - it works. apparently something already goes wrong during scanning (0=max bonds/atom…)

i also tried adding

0 dihedrals
0 impropers

as suggested in the documentation, however it had no effect.
thanks,

nikita

hi axel,

compiling your version, i have issues with read_data - it seems not to
extract information from my data file:

there are no OpenMP related changes around the read_data command
in my version of the code, so it is hard to see where the problem is.
have you tried any of the example inputs that are provided with LAMMPS.
if they don't work, something is wrong with your compilation, e.g. the
compiler may have miscompiled the code. if they do work, please post
a small (and complete!) input example, so that it can be debugged.

thanks,
     axel.

hi axel,

i ran the version i compiled from your code on the micelle example, without suffix omp.

the full error message is here: http://pastebin.com/m1cPyuTv
the LAMMPS error happens at line 28.

my startup script: http://pastebin.com/BGwmLndB
removing “-x $OMP_NUM_THREADS=4” does not change the behavior.

and the makefile i use: http://pastebin.com/ARyBSJ1b
essentially this is the same as what i used for my other compilations with GPU and CUDA, except i have the -openmp flag added.

if i diff the read_data.cpp files in your version and in the Nov 9 version, there are differences between them. however, my experience with c++ is more or less nonexistent, so it’s hard for me to judge whether these differences do any thing severe. i also tried merging files from your version with Nov9, but this only lead to a huge mess during compilation.

regards,

nikita

Hi Axel

just wanted to add that I had similar problems (i.e. reading the rhodopsin example) when I tried out your repository version. But since I am bussy preparing my defense I had not enough time to explore that further, so I was going to wait until either whatever you submitted to Steve is published, or I got more time next week again.

Cheers
Christian

-------- Original-Nachricht --------

thanks, i cannot reproduce this error. from your make file
it appears that you are using an intel compiler. which
version is it? since the intel "composer" version 12.x has
gained a bit of a reputation of miscompiling code, i would
suspect that this is due to the compiler. whether the compiler
is correct or the code, is still to be determined, we had
cases in the past, where LAMMPS code was not always
100% in agreement with the standard, but those would only
be exposed when compiler vendors changed the default
settings of which degree of standard conformance vs. legacy
behavior they expect.

can you try to compile with a different compiler?
i currently don't have access to intel 12.x, so it
is difficult for me to track this down.

cheers,
    axel.

hi christian,

Hi Axel

just wanted to add that I had similar problems (i.e. reading the rhodopsin example) when I tried out your repository version. But since I am bussy preparing my defense I had not enough time to explore that further, so I was going to wait until either whatever you submitted to Steve is published, or I got more time next week again.

thanks for the confirmation. this is strange, though. the changes in read_data
are only supposed to replace error->all() with error->one() to not lose error
messages on parallel input errors. i remember you stating that you use intel
compilers, is that correct for this one, too? i am no wondering why they would
miscompile one variant of the code and not the other, even though they are
for all practical purposes functionally equivalent. hm...

axel

hi,

i was using icc 11.1
i also tried the portland group compiler but this one only threw an enormous amount of errors.

now i compiled axel’s source with gcc 4.1.2 . read_data works with this.
thanks for the hint. it’s sort of funny…

but still, when i use the omp package, i get errors. but i believe these are due to the simulation setup
for example - the crack example works without suffix omp, however with suffix omp this happens: http://pastebin.com/Nq7b7XfX
with my other simulations i get the shake determinant<0 error.

is there anything i should do differently when using the omp package compared to ‘normal’?

regards,

nikita

Hi Axel

yes I was using the Intel compiler as backend as well.

Christian

-------- Original-Nachricht --------

Hi Axel

yes I was using the Intel compiler as backend as well.

thanks for the info.

i've recompiled with intel and have been able to
reproduce the issue, and have been able to narrow
it down to one subroutine. this should help to
identify the problem and apply a correction.

i will post a note, when this is done.

thanks,
   axel.

hi,

i was using icc 11.1

ok. i've been able to identify the source of the problem
with the intel compilers and reverted the change until
i find a way to re-introduce it so that it get compiled
correctly. my github repo code should work again.

i also tried the portland group compiler but this one only threw an
enormous amount of errors.

please find attached a script that will convert
all OpenMP pragmas in a way that PGI compilers
should be able to compile them without barking
at you. please note, that this will increase the
overhead for every encountered OpenMP region.

now i compiled axel's source with gcc 4.1.2 . read_data works with
this.
thanks for the hint. it's sort of funny...

i don't think it is funny, but it is rather sad.
compilers should compile correct code correctly
and tell you when code is not correct. sadly,
there are different interpretations of standards
and features that are more or less thoroughly
tested. whenever choosing a feature one has
thus always weigh the benefits against the
drawbacks and sometimes those are not always
as obvious as in the case of the PGI compilers.

but still, when i use the omp package, i get errors. but i believe
these are due to the simulation setup
for example - the crack example works without suffix omp, however with
suffix omp this happens: http://pastebin.com/Nq7b7XfX

this should not happen and i cannot reproduce it.
in general. the crack example is a case where energies
can be somewhat different, because rounding errors
can affect the group definitions. however, this would
apply to a lammps binary consistently across MPI and
OpenMP parallelization, so you should get the same
result when running without or with -sf omp and OMP_NUM_THREADS
set to 1 2 4 and so on respectively. this is how i
test. the explicit specification of suffix and package
omp are no longer required when using my development code.

possibly, the gcc compiler you use has a buggy OpenMP
implementation. i have only been testing in detail
with gcc 4.4.x and gcc-4.5.x and (occasionally) intel 11.1
so far. i've did one cross check against PGI, Cray and
PathScale compilers where i found the PGI OpenMP issue
with the "default(none)" clause i mentioned above.

with my other simulations i get the shake determinant<0 error.

that is also an indication of incorrect forces, but
i cannot tell how you would get those. for me the
crack example works, even if - through using MPI -
the group definition is a bit different.

is there anything i should do differently when using the omp package
compared to 'normal'?

no. any case you find a difference, please let me know.
there are now almost 250 adapted/modified files. there
is always the possibility of a few glitches.

thanks,
    axel.

hack_openmp_for_pgi.sh (180 Bytes)

hi axel,

thank you very much for your help, i compiled the new source with intel compilers and everything seems to work perfectly now. unfortunately i did not get the expected performance increase, but my simulations ran rather two to three times slower than with common LAMMPS, so i’ll stick to the original version.

i attached a small sketch of my system so you can imagine what i am working on. i don’t want to sent around all my input files as they are a few megabytes in size. so the system consists of a wall and a spherical particle, both covered with some water molecules. and only the waters and OH groups are allowed to move during the simulation. since the atom distribution is rather inhomogeneous, i thought the OMP package would help to distribute the atoms better among the cpu cores.

nevertheless, i think it was worth it. i think soon i will get some similar systems to simulate - i’ll try the OMP package there too.

anyway, thank you very much again for the great support,

nikita

schematic.png

hi axel,

thank you very much for your help, i compiled the new source with
intel compilers and everything seems to work perfectly now.
unfortunately i did not get the expected performance increase, but my
simulations ran rather two to three times slower than with common
LAMMPS, so i'll stick to the original version.

this cannot be. if there is OpenMP support for all the potentials in
your system, the worst i have found was that all-OpenMP speed was half
of what all-MPI would take. are you sure that you using a suitable
processor distribution via the processors keyword? are you sure you are
not using processor affinity, which would force all OpenMP threads to
run on the same physical core?
have you checked on which part of the
calculation LAMMPS spends the most time?

i attached a small sketch of my system so you can imagine what i am
working on. i don't want to sent around all my input files as they are
a few megabytes in size. so the system consists of a wall and a
spherical particle, both covered with some water molecules. and only
the waters and OH groups are allowed to move during the simulation.
since the atom distribution is rather inhomogeneous, i thought the OMP
package would help to distribute the atoms better among the cpu
cores.

there are a number of things that need to be done to
get good performance for a system like this. regardless
of whether you are using OpenMP parallelization or not.
if you don't mind, please send a working input to me
personally (and how you are running it). i'd like to see
by myself how this system behaves and what way would be
the best to run this kind of setup.

nevertheless, i think it was worth it. i think soon i will get some
similar systems to simulate - i'll try the OMP package there too.

without having somebody experienced look at your input,
i would not give up that easily. there may be additional
options to speed up the calculation that you have not
thought about. LAMMPS is quite flexible for "weird" systems.

axel.

wow, this is more support than i could ever wish :smiley:
i’ll send you the files in just a moment.

thanks,

nikita