LAMMPS runs but no output

david_f · August 9, 2013, 11:52am

Hi guys,

I’m having this weird problem for quite a long time now.
I first noticed it when downloading the 4Aug2013 version where I started

a job and after some time it just got “stuck”. The processors are working (or at least seem to be by typing ps -ef), however, not output is further generated - no dump, no thermo, no nothing.

Then I tried to use my good old 21Feb2013 version, It still shows this problem.
No error message is generated, no warning, just this problem of freezing output.

Help !

akohlmey · August 9, 2013, 2:48pm

Hi guys,

I'm having this weird problem for quite a long time now.
I first noticed it when downloading the 4Aug2013 version where I started
a job and after some time it just got "stuck". The processors are working
(or at least seem to be by typing ps -ef), however, not output is further
generated - no dump, no thermo, no nothing.

Then I tried to use my good old 21Feb2013 version, It still shows this
problem.
No error message is generated, no warning, just this problem of freezing
output.

Help !

the fact, that it happens with the old executable indicates, that it
is not likely a specific LAMMPS problem, but something very specific
to your setup of the machine you are running on. unfortunately, you
are not providing any usable information in that regard.

does this happen only with a specific input or any input (e.g. the
lammps examples/benchmarks)?.

possible explanations are a full or broken hard drive or a exhausted
disk quotas. when running on a cluster, there could also be a problem
with the network.

it would be helpful to determine *where* in the code it gets stuck.
you can do that be looking up the process id of rank 0 (typically the
lowest of multiple) and then use the gdb debugger with the -p flag on
this process id. this will attach the running process to the debugger
and you can press ctrl-c to get to the debugger prompt and then use
the "where" command to get a stack trace.

axel.

athomps · August 12, 2013, 9:31pm

In my experience behavior like this is usually indicates an issue with
the runtime environment (job scheduler, file system, etc.) not LAMMPS.

david_f · August 13, 2013, 3:39am

I've tried to use the gdb debugger, although I have no prior knowledge of
dubugging under linux.
I submitted the job "nohup mpiexec -np 80 $lammpsdir/lmp_mkl < in.input &"
and then started gdb by "gdb -p 10214" where 10214 is the process id with
rank 0.
When typing "where" in the gdb prompt, I get the following message:
#0 0x00002b369487db73 in __select_nocancel () from /lib64/libc.so.6
#1 0x00002b3696a1a534 in ?? () from
/usr/lib64/python2.6/lib-dynload/select.so
#2 0x00002b3693c6ebf9 in PyEval_EvalFrameEx () from
/usr/lib64/libpython2.6.so.1.0
#3 0x00002b3693c73735 in PyEval_EvalCodeEx () from
/usr/lib64/libpython2.6.so.1.0
#4 0x00002b3693c6db75 in PyEval_EvalFrameEx () from
/usr/lib64/libpython2.6.so.1.0
#5 0x00002b3693c73735 in PyEval_EvalCodeEx () from
/usr/lib64/libpython2.6.so.1.0
#6 0x00002b3693c6db75 in PyEval_EvalFrameEx () from
/usr/lib64/libpython2.6.so.1.0
#7 0x00002b3693c73735 in PyEval_EvalCodeEx () from
/usr/lib64/libpython2.6.so.1.0
#8 0x00002b3693c6c642 in PyEval_EvalCode () from
/usr/lib64/libpython2.6.so.1.0
#9 0x00002b3693c8b971 in ?? () from /usr/lib64/libpython2.6.so.1.0
#10 0x00002b3693c8ba24 in PyRun_FileExFlags () from
/usr/lib64/libpython2.6.so.1.0
#11 0x00002b3693c8c388 in PyRun_SimpleFileExFlags () from
/usr/lib64/libpython2.6.so.1.0
#12 0x00002b3693c96e6e in Py_Main () from /usr/lib64/libpython2.6.so.1.0
#13 0x00002b36947cebc6 in __libc_start_main () from /lib64/libc.so.6
#14 0x00000000004006e9 in _start ()

However this message shows up even at the beginning (first 16,000 steps)
where everything works fine, and also after the
freezing occurs.

Am I doing this right ? If so, what can I learn from this message ?

Thank you for your kind help

akohlmey · August 13, 2013, 3:42am

I've tried to use the gdb debugger, although I have no prior knowledge of
dubugging under linux.
I submitted the job "nohup mpiexec -np 80 $lammpsdir/lmp_mkl < in.input &"
and then started gdb by "gdb -p 10214" where 10214 is the process id with
rank 0.

this looks like you are debugging a python script not a LAMMPS
executable. are you sure this is not the mpiexec command?

axel.

akohlmey · August 13, 2013, 3:44am

I've tried to use the gdb debugger, although I have no prior knowledge of
dubugging under linux.

then why didn't you first try running other example scripts
distributed with LAMMPS?

axel.

david_f · August 19, 2013, 4:18pm

Hello again,

I tried running different kinds of simulations, each of them shows this
behavior.
However, now I tried to run another one (with the 16Aug2013 lammps version)
without the 'nohup' command and after some time
it showed the " WARNING: H matrix size has been exceeded: m_fill=7466
H.m=7250
(../fix_qeq_reax.cpp:574)" message.

From this point, no new output is generated despite the fact that the

processors are working on full CPU (99%). i.e, the infamous problem
I've been talking about in this thread.

I think this message which I couldn't observe before because of the 'nohup'
command, is related to the problem.

Also, I asked my sys-admin to check the stack trace issue with gdb, however
she didn't find any problem with the process id.

So, what does this "H matrix size" is about ?

Ray_Shan · August 19, 2013, 4:22pm

This error means that the H.m array in fix qeq/reax is running out of
memory. Use safezone and mincap to increase the array size, but please
note that this usually occurs when your system has bad dynamics.

Ray

sjplimp · August 20, 2013, 3:37pm

And you will probably be able to see the error more

obviously if you don’t run LAMMPS thru Python.
I.e. only run thru Python once you are certain
you have a working script.

Steve