error when running lammps

Hello All,

The help desk of the cluster I submit my jobs made lammps executable file including the desired packages such as MEAM. The newest version is 17Jun2013. When I submit my job to the cluster, the script starts to run with no problem, continues until step 500 during energy minimization, then the job crashes with the following error:

lmp:50265 terminated with signal 11 at PC=e819e3 SP=7fff05c749c0. Backtrace:
lmp(_ZN9LAMMPS_NS12WriteRestart5writeEPc+0xd3)[0xe819e3]

lmp(_ZN9LAMMPS_NS6Output5writeEl+0x33f)[0x88a87f]

lmp(_ZN9LAMMPS_NS8Minimize7commandEiPPc+0x184)[0x819554]

lmp(_ZN9LAMMPS_NS5Input15execute_commandEv+0x28f6)[0x7ec836]

lmp(_ZN9LAMMPS_NS5Input4fileEv+0x202)[0x7e8312]

lmp(main+0x94)[0x801d54]

/lib64/libc.so.6(__libc_start_main+0xfd)[0x384b81ecdd]

lmp(_ZN9LAMMPS_NS6Output5writeEl+0x33f)[0x88a87f]
lmp(main+0x94)[0x801d54]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x384b81ecdd]
lmp(main+0x94)[0x801d54]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x384b81ecdd]

lmp[0x51e7a9]

lmp(_ZN9LAMMPS_NS8Minimize7commandEiPPc+0x184)[0x819554]

lmp(main+0x94)[0x801d54]

/lib64/libc.so.6(__libc_start_main+0xfd)[0x384b81ecdd]

lmp[0x51e7a9]

lmp(_ZN9LAMMPS_NS5Input15execute_commandEv+0x28f6)[0x7ec836]

lmp(main+0x94)[0x801d54]

/lib64/libc.so.6(__libc_start_main+0xfd)[0x384b81ecdd]
lmp[0x51e7a9]

I don't think anybody will be able to help you with just the output
you provided.
You're going to have to give us more.

* Compile a version of LAMMPS with debug symbols (-g)
* Test if running in serial gives you the same problem
* Run with valgrind (can be done in parallel as well) to debug memory issues

Well maybe someone knows if there is an issue with write_restart in
that version.

Hello All,

The help desk of the cluster I submit my jobs made lammps executable file
including the desired packages such as MEAM. The newest version is
17Jun2013. When I submit my job to the cluster, the script starts to run
with no problem, continues until step 500 during energy minimization, then
the job crashes with the following error:

without debug info that correlates the stack trace below to the
specific source file, it is impossible to track this down.

please check if you have the same issue, if you add "write_rstart" to
any of the input examples shipped with LAMMPS.

lmp:50265 terminated with signal 11 at PC=e819e3 SP=7fff05c749c0.
Backtrace:
lmp(_ZN9LAMMPS_NS12WriteRestart5writeEPc+0xd3)[0xe819e3]
lmp(_ZN9LAMMPS_NS6Output5writeEl+0x33f)[0x88a87f]
lmp(_ZN9LAMMPS_NS8Minimize7commandEiPPc+0x184)[0x819554]
lmp(_ZN9LAMMPS_NS5Input15execute_commandEv+0x28f6)[0x7ec836]
lmp(_ZN9LAMMPS_NS5Input4fileEv+0x202)[0x7e8312]
lmp(main+0x94)[0x801d54]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x384b81ecdd]
lmp(_ZN9LAMMPS_NS6Output5writeEl+0x33f)[0x88a87f]
lmp(main+0x94)[0x801d54]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x384b81ecdd]
lmp(main+0x94)[0x801d54]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x384b81ecdd]
lmp[0x51e7a9]
lmp(_ZN9LAMMPS_NS8Minimize7commandEiPPc+0x184)[0x819554]
lmp(main+0x94)[0x801d54]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x384b81ecdd]
lmp[0x51e7a9]
lmp(_ZN9LAMMPS_NS5Input15execute_commandEv+0x28f6)[0x7ec836]
lmp(main+0x94)[0x801d54]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x384b81ecdd]
lmp[0x51e7a9]

As we discussed this problem, the error is the segmentation fault error
related to the memory management. Is this error solvable?

it should be, but there are many possible sources. the first step is
always to try and narrow down under which conditions this is
reproducible.

axel.

Dear Axel,

Here is the error file with the debug option:

lmp_debug:52149 terminated with signal 11 at PC=3064924988 SP=7fff60fdb5a8. Backtrace:

lmp_debug:52151 terminated with signal 11 at PC=3064924988 SP=7fffbc001158. Backtrace:
/lib64/libc.so.6[0x3064924988]
/lib64/libc.so.6[0x3064924988]
lmp_debug(_ZN9LAMMPS_NS12WriteRestart5writeEPc+0x172)[0x17a778a]
lmp_debug(_ZN9LAMMPS_NS12WriteRestart5writeEPc+0x172)[0x17a778a]
lmp_debug(_ZN9LAMMPS_NS6Output5writeEl+0x786)[0xa66cf2]
lmp_debug(_ZN9LAMMPS_NS6Output5writeEl+0x786)[0xa66cf2]
lmp_debug(_ZN9LAMMPS_NS3Min3runEi+0x281)[0x9c475d]
lmp_debug(_ZN9LAMMPS_NS3Min3runEi+0x281)[0x9c475d]
lmp_debug(_ZN9LAMMPS_NS8Minimize7commandEiPPc+0x3b1)[0x9d8d9d]
lmp_debug(_ZN9LAMMPS_NS8Minimize7commandEiPPc+0x3b1)[0x9d8d9d]
lmp_debug(_ZN9LAMMPS_NS5Input15execute_commandEv+0x1a24)[0x9a36b4]
lmp_debug(_ZN9LAMMPS_NS5Input15execute_commandEv+0x1a24)[0x9a36b4]
lmp_debug(_ZN9LAMMPS_NS5Input4fileEv+0x57a)[0x9a0a14]
lmp_debug(_ZN9LAMMPS_NS5Input4fileEv+0x57a)[0x9a0a14]
lmp_debug(main+0xed)[0x9bd919]
lmp_debug(main+0xed)[0x9bd919]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x306481ecdd]
lmp_debug[0x541de9]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x306481ecdd]
lmp_debug[0x541de9]

Dear Axel,

Here is the error file with the debug option:

this is useless, since it doesn't correlate the code with the source code.

The problem is write_restart, however when I use it with some other codes
which don't include MEAM potential, lammps doesn't crash at all. I guess the
problem is with write_restart files when MEAM potential is used. Any idea?

it is impossible to say anything without suitable information. either
you debug it yourself, i.e. generate a stack trace.

or you put together the smallest possible (complete) input that
reproduces this issue and post it here. also you have to explain
exactly how to reproduce it, whether it only happens in parallel runs,
or also when running in parallel etc.

without sufficient information it requires a crystal ball or a psychic
(or both) to find out what is going on.

axel.

Hi,

I used gdb and could run the script with no problem in serial. I used:

gdb PATH/lmp_debug
run -in input.txt

I believe the problem pops up when I execute the script in parallel. I couldn’t find how to debug the script in parallel. Could anyone help me me do that?

Thanks,
Iman.

Hi,

I used gdb and could run the script with no problem in serial. I used:

gdb PATH/lmp_debug
run -in input.txt

I believe the problem pops up when I execute the script in parallel. I
couldn't find how to debug the script in parallel. Could anyone help me me
do that?

please step back for a moment that look at what you are asking for:

- you don't provide any tangible info that allows to reproduce what
you are seeing, in fact, you are more unspecific about it that most.
the politics of supporting users of open source software are simple:
if you want help, you have to make it easy to help you. nobody here
gets paid to service you, so how can you expect somebody to volunteer
his time, if you make it difficult to help you?

- you are asking for help in debugging in general and not specifically
LAMMPS and this is not the forum for it. there are tutorials and
courses that you can attend to learn this (i have taught quite a few
of them during my career); there are user support people in computing
centers that (sometimes) can teach and/or help you (any they *do* get
paid to help their users).

- you ask that somebody will do something for you that is effectively
your problem and for which you have shown to be extremely
uncooperative in helping to track it down. what kind of a person would
engage in such a matter? most certainly not a very smart one. which
begs the question, why would you even want help from such a person.

so rather than shooting another e-mail from there hip. i ask you think
about what you are asking for (and what kind of value this represents;
and to you in person specifically) and then reconsider how you are
going to go about this. this policy of "lets just ask for it anyway,
perhaps i get lucky" is rather irritating and becoming a waste of
everybody's time.

a mailing list has to be a give and take. it looks that you only want
to take and not give anything.

axel.

Dear Axel,

I don’t know why you are so judgmental! I have encountered a problem related to LAMMPS and I want to check if this is something like a bug in the code.

I honestly don’t want anybody to work on my problem. The only thing I am asking is the help, a tip or some guidance! At the end of the day, this is my problem. There is always a chance that somebody else had this problem before and he/she could help me with that!

For sure, if I can help anybody in the forum, I would do.

I have searched the mailing list for the possible way to solve my problem but I didn’t succeed. In addition, I, as many others in forum, am not as experienced as some of you are in these types of stuff. As a result, I want to get help from those which are more experienced in some areas than I am.

Iman.

Dear Axel,

I don't know why you are so judgmental! I have encountered a problem related

because this is how i am. i call out stuff how i see it.

to LAMMPS and I want to check if this is something like a bug in the code.

that is fine. i am always happy to track down a bug. yet you are
denying the very information that allows to do this.

I honestly don't want anybody to work on my problem. The only thing I am
asking is the help, a tip or some guidance! At the end of the day, this is

you *have* been given tips. as has been pointed out several times to
you. it is _not_ possible to provide additional help without you
providing sufficient information, which you have conveniently ignored.

my problem. There is always a chance that somebody else had this problem
before and he/she could help me with that!

a gazillion people have had segmentation faults. those are *very*
unspecific. it is pointless to go on that alone.

For sure, if I can help anybody in the forum, I would do.

I have searched the mailing list for the possible way to solve my problem
but I didn't succeed. In addition, I, as many others in forum, am not as
experienced as some of you are in these types of stuff. As a result, I want

yet that doesn't entitle you to act as if people that do have that
experience are obliged to solve your problem without you following
their requests for more details. it is this attitude that is
particularly upsetting.

to get help from those which are more experienced in some areas than I am.

as i explained before, if you *do* want help, you have to *provide*
the means to reproduce the error elsewhere. you have been told to do
so now several times, yet you failed to comply.

*this* is why i am yelling at you. is this so hard to understand?
so "put up or shut up"

axel.

Again this is how you see it! it doesn’t mean this is a right way!

I debugged the code, and no problem occurred in serial. I can’t give you any stack trace because there was no problem running it!!! I think giving my code to somebody to work on, is the last resort! isn’t it!!!

Now I am looking for a way to debug it in parallel! BTW, I am not yelling at you and not insulting you! this is a scientific forum! please be polite.

BTW, I am feeling not good writing these emails to you! you might answer this one in a worse style but this is the last email I post in your reply in the forum.

Iman.

Again this is how you see it! it doesn't mean this is a right way!

no. it is just my opinion.

I debugged the code, and no problem occurred in serial. I can't give you any
stack trace because there was no problem running it!!!!! I think giving my
code to somebody to work on, is the last resort! isn't it?!!!

no. on the contrary, this is how it is usually done. please look
through the archives and read through previous discussions where we
have tracked down bugs. in the best case, people narrow it down to the
bare essentials.

Now I am looking for a way to debug it in parallel! BTW, I am not yelling at
you and not insulting you! this is a scientific forum! please be polite.

BTW, I am feeling not good writing these emails to you! you might answer
this one in a worse style but this is the last email I post in your reply in
the forum.

no worries. i have said all i have to say.

axel.

The problem is write_restart, however when I use it with some other codes which don’t include MEAM potential, lammps doesn’t crash at all. I guess the problem is with write_restart files when MEAM potential is used. Any idea?

It’s unlikely that MEAM has anything to do with the issue,
b/c the MEAM potential doesn’t write anything to the restart file.

I suggest you post a simple-as-possible script (with a data
file if needed) that crashes on some small number of procs
for you. This is what Axel is asking for.

If the developers can reproduce the problem, it is often
easy to fix.

Steve

Dear Steve,

Thanks for the reply,

I have attached the code with the necessary files for the MEAM potential. I am also working on the problem using totalview debugger.

Regards,
Iman.

meam,test,restart,parallel.txt (958 Bytes)

NbC.meam (725 Bytes)

library2,fromWSK.meam (557 Bytes)

Dear Steve,

Thanks for the reply,

I have attached the code with the necessary files for the MEAM potential. I

i can run this on my test machine using the current version of LAMMPS
in serial or parallel without a problem. also running with valgrind
doesn't say a peep (outside of the nasty things that OpenMP does, that
don't count). do you need to run with a specific number processors to
trigger the segfault. are you sure it is not a problem due to running
out of disk space? this job writes out an awful lot of restart files
at 375kB each.

am also working on the problem using totalview debugger.

waste of money:

echo run > run.gdb
mpirun -np 4 xterm -e gdb -x run.gdb --args lmp_openmpi -echo screen
-in meam,test,restart,parallel.txt

and you are in business. :wink:
works for valgrind, too.

axel.

Thanks,

There is enough disk space. I ran it with 4 processors and 16 processors, both resulted in the same error.

If it works in your machine, there might be some problem with the machine I am using. The strange thing is that I don’t have any problem when I use write_restart with EAM potential in my other scripts.

Iman.