lammps errors and python interface

Noam_Bernstein · September 13, 2016, 6:25pm

Hi all - as I’ve been doing for a while, I’m running LAMMPS on the bare edge of sensible trajectories, with the python interface. This means that occasionally I get trajectories that blow up, lead to lost atoms, and lammps quits. I know how to handle this from the point of view of the rest of my code (throw out the trajectory I was working on), but I don’t know how to deal with it from the point of view of the python interface. As far as I can tell my code just hangs.

Can anyone suggest any way to detect and restart a lammps process that was started from the python interface and eventually aborted?

thanks,
Noam

akohlmey · September 13, 2016, 7:52pm

Hi all - as I’ve been doing for a while, I’m running LAMMPS on the bare edge
of sensible trajectories, with the python interface. This means that
occasionally I get trajectories that blow up, lead to lost atoms, and lammps
quits. I know how to handle this from the point of view of the rest of my
code (throw out the trajectory I was working on), but I don’t know how to
deal with it from the point of view of the python interface. As far as I
can tell my code just hangs.

Can anyone suggest any way to detect and restart a lammps process that was
started from the python interface and eventually aborted?

i assume this is using LAMMPS in parallel via MPI, right?
when LAMMPS hangs, it is already too late. there is little you can do
at this point.
the biggest issue in this case is MPI. there is no way to recover MPI
from within a running process. the MPI specs define that MPI_Init()
and MPI_Finalize() can only be called once, but if you are stuck in a
mismatched MPI communication or would need to call MPI_Abort() to
clear out the state of running processes, it is "game over". you can
only use MPI again, after stopping and launching a new mpirun/mpiexec
session.

so, you'd need to have some kind of "detector" that can determine such
a problematic situation, *before* it happens.

we've had a discussion on how to handle errors in the LAMMPS shared
library more gracefully and settled for adding support for C++
exceptions. thus if you have the very latest development version of
LAMMPS, you can compile the code with -DLAMMPS_EXCEPTIONS, and then
for "normal" problems, e.g. for the case of lost atoms, you should be
able to recover.

keep in mind, that this is very new functionality, so it may change,
and there may be some improvements in the python module for it.

axel.

sjplimp · September 13, 2016, 8:13pm

To amplify a bit, if the error you are getting via Python

is that LAMMPS generates and error message and Python

quits (e.g. due to LAMMPS calling MPI_Abort()), then

the new LAMMPS_EXCEPTIONS option should allow

you to continue. I.e. LAMMPS will generate the error

message,

but you come back to a Python prompt, so you could

run LAMMPS again. i.e. a new script. This might

also work if LAMMPS crashes (due to bad dyanmics),

i.e. control will be returned to Python.

I would be surprised if LAMMPS actually “hangs” due

to bad dynamics.

Steve

Noam_Bernstein · September 13, 2016, 8:18pm

Thanks. I’m not doing it interactively, so I’m not sure exactly where the flow goes, but it is generating an error. Where is this LAMMPS_EXCEPTIONS option documented? I don’t really understand what it might be an option to. The python class? How recent is it?

I have actually seen it do that (remap taking an absurd amount of time because it loops over how far to remap the atoms, rather than using int on the lattice coordinates position), too, but not in this case.

Noam

akohlmey · September 13, 2016, 8:19pm

To amplify a bit, if the error you are getting via Python
is that LAMMPS generates and error message and Python
quits (e.g. due to LAMMPS calling MPI_Abort()), then
the new LAMMPS_EXCEPTIONS option should allow
you to continue. I.e. LAMMPS will generate the error
message,
but you come back to a Python prompt, so you could
run LAMMPS again. i.e. a new script. This might

i disagree. LAMMPS will call MPI_Abort() only in the case of some part
of the code calling error->one().
that usually means, that only one process is having a problem, and
that would most likely be a situation, that you can't recover from,
unless you are only running with one MPI task.
what *can* be handled is the case of error->all(), and lost atoms
falls into that one.

axel.

akohlmey · September 13, 2016, 8:28pm

To amplify a bit, if the error you are getting via Python
is that LAMMPS generates and error message and Python
quits (e.g. due to LAMMPS calling MPI_Abort()), then
the new LAMMPS_EXCEPTIONS option should allow
you to continue. I.e. LAMMPS will generate the error
message,
but you come back to a Python prompt, so you could
run LAMMPS again. i.e. a new script. This might
also work if LAMMPS crashes (due to bad dyanmics),
i.e. control will be returned to Python.

Thanks. I’m not doing it interactively, so I’m not sure exactly where the
flow goes, but it is generating an error. Where is this LAMMPS_EXCEPTIONS
option documented? I don’t really understand what it might be an option to.
The python class? How recent is it?

it is an option that changes code inside of LAMMPS and the library interface.
http://lammps.sandia.gov/doc/Section_start.html#start-2 look for "Step 4".

it is *very* recent.

commit 639ab0fd3e37c2e463987d5c361967ac8946cbeb
Merge: 6c65af7 f5a50c3
Author: Steve Plimpton <[email protected]>

Merge branch 'core/cpp_exceptions' of
https://github.com/rbberger/lammps into error

I would be surprised if LAMMPS actually "hangs" due
to bad dynamics.

I have actually seen it do that (remap taking an absurd amount of time
because it loops over how far to remap the atoms, rather than using int on
the lattice coordinates position), too, but not in this case.

that is a case, where you would need to implement a "fix sanity/check"
that won't let LAMMPS continue when atoms move too far, too fast (or
some other detectable "bad" situation).
as explained above, with the exception handling enable, you just set a
flag to 0/1 for processors without/with a problematic situation, do an
MPI_Allreduce() with MPI_SUM on them and then call error->all() in
case the some of all flags is > 0). that will generate a clean
exception on all processors without messing up MPI.

axel.

Noam_Bernstein · September 13, 2016, 8:49pm

Thanks to all of you for the help. In this case it is a clean exit with error->all(), so exceptions could probably work. Do the C++ exceptions get magically turned into python exceptions? Or do I need to get python to call the lammps_has_error() function?

Noam

akohlmey · September 13, 2016, 10:06pm

Thanks. I’m not doing it interactively, so I’m not sure exactly where the
flow goes, but it is generating an error. Where is this LAMMPS_EXCEPTIONS
option documented? I don’t really understand what it might be an option to.
The python class? How recent is it?

it is an option that changes code inside of LAMMPS and the library
interface.
http://lammps.sandia.gov/doc/Section_start.html#start-2 look for "Step 4”.

Thanks to all of you for the help. In this case it is a clean exit with
error->all(), so exceptions could probably work. Do the C++ exceptions get
magically turned into python exceptions? Or do I need to get python to call
the lammps_has_error() function?

you should get an exception in python automagically. check this out:

https://github.com/lammps/lammps/blob/lammps-icms/python/lammps.py#L153

axel.

Noam_Bernstein · September 14, 2016, 8:59pm

Great. One, hopefully final, question. What’s the state of the lammps object once the lost atoms exception is thrown, i.e. what do I need to do to continue computation? Clear the exception someone? Recreate the system by deleting all the atoms and creating a new system?

thanks,
Noam

akohlmey · September 14, 2016, 9:02pm

Thanks to all of you for the help. In this case it is a clean exit with
error->all(), so exceptions could probably work. Do the C++ exceptions get
magically turned into python exceptions? Or do I need to get python to call
the lammps_has_error() function?

you should get an exception in python automagically. check this out:

https://github.com/lammps/lammps/blob/lammps-icms/python/lammps.py#L153

Great. One, hopefully final, question. What’s the state of the lammps
object once the lost atoms exception is thrown, i.e. what do I need to do to
continue computation? Clear the exception someone? Recreate the system by
deleting all the atoms and creating a new system?

obviously, all system/simulation data is in an undefined state, so at
the very least, you would have to issue a "clear" command. you barely
escaped a full crash.

axel.

sjplimp · September 14, 2016, 9:37pm

Good point - I forgot the distinction between error-> one() vs all().

If one proc out of many calls error->one(), I think you are hosed.

No way to notify the other procs to exit back to Python.

Axel - any problem with having error->one() check if you are

running on just one proc (world and universe), and if so, do

not invoke MPI_Abort(). Then it would operate as if it

were a call to all() ?? I.e. you could recover in Python and

launch again?

Steve

sjplimp · September 14, 2016, 9:41pm

So “hang” doesn’t mean forever, but just that it runs (much)

more slowly?

Can you post a simple/small script that does this on one proc?

Or is it only in parallel?

It sounds like what you are describing is that in one timestep

you blow N atoms out of the box ~10^8 box lengths

away, and then it takes 10^8 iterations to remap each atom back

into the periodic box?

I’d like to understand what call sequence is doing that.

Maybe it’s possible to just throw an error, since are undoubtably

hosed if that happens.

Steve

akohlmey · September 14, 2016, 11:30pm

Good point - I forgot the distinction between error-> one() vs all().
If one proc out of many calls error->one(), I think you are hosed.
No way to notify the other procs to exit back to Python.

Axel - any problem with having error->one() check if you are
running on just one proc (world and universe), and if so, do
not invoke MPI_Abort(). Then it would operate as if it
were a call to all() ?? I.e. you could recover in Python and
launch again?

if i read the code correctly, then it already handles the error->one()
situation differently from error->all() by throwing a different kind
of exception.
the "abort-like" exception contains a handle to the world
communicator, so it should be straightforward to catch this in
library.cpp as well, and then
run MPI_Comm_size() and report the error similar to error->all() for
communicator sizes == 1 (and 0?) and call MPI_Abort() for everything
else.

library.cpp currently doesn't seem to catch this exception (but it should, IMO).

axel.

Noam_Bernstein · September 15, 2016, 11:22am

So “hang” doesn’t mean forever, but just that it runs (much)

more slowly?

Can you post a simple/small script that does this on one proc?

Or is it only in parallel?

It sounds like what you are describing is that in one timestep

you blow N atoms out of the box ~10^8 box lengths

away, and then it takes 10^8 iterations to remap each atom back

into the periodic box?

Yes, that was what I decided was happening.

I’d like to understand what call sequence is doing that.

Maybe it’s possible to just throw an error, since are undoubtably

hosed if that happens.

My recollection is that it happened when I ran with rather large timesteps, so an atom could end up basically on top of another, and in the next step gain an absurd amount of kinetic energy. I’ll try to get an example that reproduces the problem. I’ll need to go back and figure out how I dealt with it, because I did come up with some sort of workaround, so it hasn’t happened in a while.

Noam