LAMMPS crashes with exit code 134

Michael_Demkowicz · July 7, 2016, 2:35pm

I’m running LAMMPS in parallel (an EAM simulation of mechanical loading) on a NERSC machine and find that the code crashes after about 1-2 hours of wall time (out of a projected time of 3h). I’ve run the same script twice now with the same outcome in both cases, albeit after a slightly different number of LAMMPS time steps in each case.
Investigating the output does not indicate any clear problem with the simulation (the model still looks reasonable, the physical output up to the point of crash look fine, no quantities are exploding…). The FAIL message I get from NERSC gives exit code 134. I could not find any LAMMPS documentation that explains the meaning of this code. I am also asking the staff at NERSC for help.

Any thoughts?

akohlmey · July 7, 2016, 10:01pm

I'm running LAMMPS in parallel (an EAM simulation of mechanical loading) on
a NERSC machine and find that the code crashes after about 1-2 hours of wall
time (out of a projected time of 3h). I've run the same script twice now
with the same outcome in both cases, albeit after a slightly different
number of LAMMPS time steps in each case.
Investigating the output does not indicate any clear problem with the
simulation (the model still looks reasonable, the physical output up to the
point of crash look fine, no quantities are exploding...). The FAIL message
I get from NERSC gives exit code 134. I could not find any LAMMPS
documentation that explains the meaning of this code. I am also asking the
staff at NERSC for help.

the exit code must be coming from the MPI library and thus would help us much.
LAMMPS usually prints something to stdout before it commits suicide,
so you'll have to check out the various job output files (not just the
log files, but those from the batch system), if there is anything
useful in them.

if possible, try running in an interactive slot (i don't know about
NERSC, but places like them usually have a "debug" queue for that).

outside of that, try to reduce the problem set size, and see if you
can trigger the issue under circumstances where it takes only a few
minutes on a desktop with a handfull (or two) of CPU cores. then you
could post it here and we can take a closer look.

axel.

_Diaz_Adrian · July 7, 2016, 10:06pm

Those exit codes are provided by Unix, usually I get exit code 1 for things like typos on an input script. I’ve only gotten 134 for a segfault on a piece of code I edited myself when testing.

Michael_Demkowicz · July 8, 2016, 1:41pm

Thanks, Axel. I did not encounter any problems with the code when I ran it in serial, but will try again. My job is too big to run on the debug partition at NERSC. I’ll see what else they might suggest.

I did not see any other output files besides log.lammps (and the captured output to terminal, which just repeats the same info as log.lammps). Do you know any other place I could look?

Michael_Demkowicz · July 8, 2016, 1:42pm

I’m using the default build of LAMMPS on NERSC—no custom code.

Michael_Demkowicz · July 8, 2016, 1:43pm

BTW, they use the 06/28/2014 release of LAMMPS as the default.

akohlmey · July 8, 2016, 2:09pm

Thanks, Axel. I did not encounter any problems with the code when I ran it in serial, but will try again. My job is too big to run on the debug partition at NERSC. I’ll see what else they might suggest.

I did not see any other output files besides log.lammps (and the captured output to terminal, which just repeats the same info as log.lammps). Do you know any other place I could look?

batch systems typically have _two_ screen captures, regular output and
error output. sometimes they get combined. since log.lammps is usually
block buffered, it still is worth looking at the screen captures, as
it may have a few more lines of output at the end before the crash
happens. seeing the error channel capture would be even more helpful
(as that is typically unbuffered).

axel.

akohlmey · July 8, 2016, 2:13pm

BTW, they use the 06/28/2014 release of LAMMPS as the default.

if there is a way to access a newer version, it would be worth giving it a try.
since last summer, there have been quite extensive efforts to improve
overall code quality and consistency through using various analysis
and testing tools.

one issue that sometimes shows up on supercomputers, is that fixes and
computes may sometimes segfault when there are no atoms in one or more
of the the subdomains. this is most likely with a very large number of
processors and (somewhat) sparse geometries. we've done multiple
passes to eliminate these issues over time.

axel.

sjplimp · July 8, 2016, 2:18pm

Actually you’re using a 2-year old version, so I definitely

second Axel’s advice to try the current version. If that

still fails, post the output that would be coming to the

screen if you were running interactively. This should

be in one of the batch queue output files. That

is where any LAMMPS message should be.

Steve

Michael_Demkowicz · July 8, 2016, 6:52pm

I did look at the screen capture output, but there is nothing interesting there: just the regular thermo output, interrupted at the point where the run crashed.

Michael_Demkowicz · July 8, 2016, 6:53pm

Good idea, I’ll work with NERSC to get a newer version.

My simulation doesn't’ have much in the way of empty space, so load balancing shouldn’t be an issue.

Michael_Demkowicz · July 8, 2016, 6:54pm

Will do,

_Diaz_Adrian · July 8, 2016, 7:02pm

Does NERSC let you run your own executables? if so you can just make the newest version of lammps and use it.

Michael_Demkowicz · July 8, 2016, 7:04pm

I don’t know, but will check and notify.

Michael_Demkowicz · July 9, 2016, 3:02pm

Steve, Axel,

The issue has been resolved. I was accidentally using the wrong file system with a low storage quota. After switching to a different one everything works fine.

The collateral benefit of this exchange is that NERSC will update to the newest version of LAMMPS.

Thanks for your help!