[lammps-users] patch to prevent the library from killing the code linking it

Riccardo_Di_Meo · January 30, 2010, 7:15pm

Hi Steve, Dear All

while using the library interface of LAMMPS, I have noticed that the application linking it can be killed by the library under certain circumstances (e.g. an unrecognized command is passed to the lammps_command function); although the library interface is already quite useful as it is, this behavior is highly undesirable for a library, therefore I have created a small patch that tries to circumvents this issue without breaking any existing working code.

The library essentially adds a flag to the Error class that changes the behavior of the universal_all and all methods, thus allowing them to throw an exception, instead of killing the whole program. As said, this flag is unset by default, and by overloading the LAMMPS and Error constructors I kept this modification compatible with older code. The exception thrown is in exception.cpp and exception.h, although I could have probably thrown a (char *).

In order to maintain the library.h interface C compatible (which is btw quite important for me...), I changed the code trapping the exception before it reaches the user; the user can understand if something got wrong by retrieving the last error (stored in the Error class as a private member) through an appropriate function.

The patch seems to work fine, doesn't affect the normal LAMMPS operations (as an executable LAMMPS behaves exactly the same) and doesn't break code compatibility, If you would find some time to give it a look and maybe consider it for including in the distribution, I would be grateful

Thanks for your attention,
Riccardo

PS: in attachment the patch and the files modified.

The patch has been created against the 24january package (LC_ALL=C TZ=UTC0 diff -Naur lammps-24Jan10 lammps-24Jan10-patched) and can be applied, from the src directory, with the command: "patch -i /tmp/patch.txt -p2" (sorry if I'm a bit too verbose, but I'm quite new to the whole patch/diff thing...)

patch.txt (8.22 KB)

error.cpp (4.31 KB)

error.h (1.13 KB)

exception.cpp (213 Bytes)

exception.h (280 Bytes)

lammps.cpp (10.5 KB)

lammps.h (2.1 KB)

library.cpp (5.83 KB)

library.h (1.23 KB)

sjplimp · February 2, 2010, 5:35pm

I have noticed that the
application linking it can be killed by the library under certain
circumstances (e.g. an unrecognized command is passed to the lammps_command
function); although the library interface is already quite useful as it is,
this behavior is highly undesirable for a library

Well, many useful libraries do this - such as MPI and Numerical Recipes.

I looked at your code additions. It's an interesting idea. But I don't see
how it would be useful. What is a scenario where the calling program generates
a LAMMPS error and can do something useful after that has occurred?
Also, there are many, many LAMMPS errors which occur deep in the program
and will leave LAMMPS in a bad state, where nothing further can be done, memory
is allocated in an odd way, etc. So I don't see how you would recover
from that.
Also, my understanding is that C++ exception handling can add considerable
overhead to a program, so that would be undesirable.

Steve

Riccardo_Di_Meo · February 2, 2010, 7:05pm

Steve Plimpton wrote:

I have noticed that the
application linking it can be killed by the library under certain
circumstances (e.g. an unrecognized command is passed to the lammps_command
function); although the library interface is already quite useful as it is,
this behavior is highly undesirable for a library

Hi Steve,

and first of all: thank you for your time and for having give a look at my code

Well, many useful libraries do this - such as MPI and Numerical Recipes.

maybe I'm wrong, but as far as I know, MPI doesn't terminate the user program (and never happened to me at least): calls to MPI_Finalize and MPI_Abort just finalize the MPI environment (I'm referring to the C interface, but probably the C++ interface differences are purely cosmetic) and leave the user do the rest, which doesn't necessarily have to be as simple as calling "exit"; as for the Numerical Recipes routines... well, I'm confident they are top notch from the "scientific computing" perspective, but they hardly make a case for good coding/planning style

I looked at your code additions. It's an interesting idea. But I don't see
how it would be useful. What is a scenario where the calling program generates
a LAMMPS error and can do something useful after that has occurred?

any attempt of automated execution would probably go through that, and below are some scenarios that comes to my mind now.

The program might have some finalization to do: It's a bit of a dumb example, but every code involving network connections usually requires some termination of sort to make the peer happy (at least to make it know what happened); wrapping the code with a C or C++ interface to provide some kind of graphical interface would become easier if the code wouldn't kill the interface itself on error, giving the GUI the chance to show one of those (annoying) error windows M$ users seem to be so fond of... (i'm the first not to like much the idea of a GUI, a web portal perhaps could make for a better case).

As for the reason I came across the problem in first place, I have created (a quite raw and all but elegant...) python interface to LAMMPS and I noticed that the interpreter was killed in case of error. Now, I can work around the problem (spawn another process, communicate with it through a pipe and so on...) but at this point the tight python-LAMMPS integration through C becomes pointless and falling back to invocation of LAMMPS from the command line through os.system("lammps < something") becomes both more elegant and effective.

I think there also could be many other cases, some regarding possible interfaces (scripting languags, portals, GUI etc...) and other which can be of some scientific significance; I'm not a chemist, but people might, for example, have the need for a program that moves an atom in some way and but if a "Cannot compute PPM" or some other error appear the code might take some countermeasure, like moving the same atom a little less or taking some other course of action altogether. I'm sorry if this "example" (quotes intended) looks so dumb, but my point is: if an user has the opportunity to intercept the errors and recover "in some way" from them, there will be eventually a use of it.

Also, there are many, many LAMMPS errors which occur deep in the program
and will leave LAMMPS in a bad state, where nothing further can be done, memory
is allocated in an odd way, etc. So I don't see how you would recover
from that.

I wasn't actually thinking to recover the LAMMPS instance at all: instead in many cases an user could create another LAMMPS instance, recover the last restart and continue from where he/she left.

Of course even the above scenario wouldn't be always feasible because LAMMPS could be left in too a bad shape to be properly finalized (some fix may, for example, mess with memory pointers in a way that a free on them could crash the code with a segfault, and no exception throwing would help in that case...) however it would be still a step forward toward a better integration at the library level, and in many cases the "run the simulation - detect an error - close lammps - restart" scenarios would still work.

Some errors however doesn't seem to leave LAMMPS in a bad state at all, I'm not sure but although passing a badly formatted command to the parser kills LAMMPS, the problem itself doesn't actually corrupt the state of the program so in principle this is an error from which recover could be possible. Since some of the LAMMPS code is provided by others there's no way to know in advance if an error is recoverable or not, however an user might still have a chance to judge by itself for many of them.

Also, my understanding is that C++ exception handling can add considerable
overhead to a program, so that would be undesirable.

You are right and that's the reason why I doubted my patch could be accepted in first place: in defense of my solution, I might object that for the way in which it's designed, it will never interfere with the normal LAMMPS operations, and _if_ it would cause troubles of any kind, they would be only at the library interface level, where catching all exceptions before they reach the user is fairly easy anyway (I'm thinking about exceptions leaking to C users, which is my case).

I grepped the code before sending the patch, and although I didn't found a single exception in it, I thought my modification could still be a "one time exception" to the "no exceptions" rule (sorry for the stupid word joke... ), I thought quite a bit about it, and I don't think there are other solutions that could provide the same feature without turning LAMMPS up side down completely, which is unlikely and condemning the library interface to remain under-developed, which is a bit of a pity.

Anyway I understand that the library interface is very secondary when compared with the direct binary execution, and so I can't find the idea of rejecting exceptions as a cautionary measure as unreasonable

thank you again for your time and patience
Riccardo

akohlmey · February 2, 2010, 9:27pm

ciao riccardo,

> I looked at your code additions. It's an interesting idea. But I don't see
> how it would be useful. What is a scenario where the calling program generates
> a LAMMPS error and can do something useful after that has occurred?
>

any attempt of automated execution would probably go through that, and
below are some scenarios that comes to my mind now.

The program might have some finalization to do: It's a bit of a dumb
example, but every code involving network connections usually requires
some termination of sort to make the peer happy (at least to make it
know what happened); wrapping the code with a C or C++ interface to
provide some kind of graphical interface would become easier if the code
wouldn't kill the interface itself on error, giving the GUI the chance
to show one of those (annoying) error windows M$ users seem to be so
fond of... (i'm the first not to like much the idea of a GUI, a web
portal perhaps could make for a better case).

there are a number of problems with your suggestion. if you hook up
to a GUI, how do you plan to handle the case of parallel execution?
it would be _much_ better and more portable to use a socket interface
and remote control lammps through that. this way you might not even
need to link to the library, but just launch lammps from a fork'd
of process, open a pipe and then feed it script command with
write(2)/read(2). this way it is very easy to track the status
of your lammps execution.

As for the reason I came across the problem in first place, I have
created (a quite raw and all but elegant...) python interface to LAMMPS
and I noticed that the interpreter was killed in case of error. Now, I
can work around the problem (spawn another process, communicate with it
through a pipe and so on...) but at this point the tight python-LAMMPS
integration through C becomes pointless and falling back to invocation
of LAMMPS from the command line through os.system("lammps < something")
becomes both more elegant and effective.

your elegant solution has a serious flaw. as far as i understand the MPI
standard, you must not call MPI_Init()/MPI_Finalize() multiple times.
so any attempt to re-launch a failed call to lammps will be a problem.

I think there also could be many other cases, some regarding possible
interfaces (scripting languags, portals, GUI etc...) and other which can
be of some scientific significance; I'm not a chemist, but people might,
for example, have the need for a program that moves an atom in some way
and but if a "Cannot compute PPM" or some other error appear the code
might take some countermeasure, like moving the same atom a little less
or taking some other course of action altogether. I'm sorry if this
"example" (quotes intended) looks so dumb, but my point is: if an user
has the opportunity to intercept the errors and recover "in some way"
from them, there will be eventually a use of it.

what you'd need here is using proper checkpointing/restarting. if the
internal state of the library is messed up by an error, there is usually
no hope to recover from it _and_ keep running. codes like lammps have to
_assume_ that the user knows what he/she is doing, or else people would
need to spend even more time programming for the unusual cases, and less
on programming the important case. keep in mind, that you don't plan to
write software that does something vital. it is totally sufficient to
get a "bummer! you screwed up. better luck next time." type of error and
then have the calculation stop when you can analyze what went wrong.
any attempt to recover without launching a complete new process will
always have the risk of some "residual badness". first and foremost
people want results they can trust. writing restarts/checkpoints in a
timely fashion is the smartest way to recover from problems.

[...]

Some errors however doesn't seem to leave LAMMPS in a bad state at all,
I'm not sure but although passing a badly formatted command to the
parser kills LAMMPS, the problem itself doesn't actually corrupt the
state of the program so in principle this is an error from which recover

i don't agree with that assessment. you will for certain produce
memory leaks. most of the code simply was not written with that kind
of scenario in mind, and i think you are wasting your time when you
want to fix that. keep in mind that LAMMPS is running MPI in a
synchronous way, so it is not easy to communicate problems from one
node to others without adding lots of additional communication that
will impact performance. as i wrote before, the main concern is to
run correctly and efficiently. do you think, you can implement a
more graceful error handling without interfering with those goals?
it suspect not.

could be possible. Since some of the LAMMPS code is provided by others
there's no way to know in advance if an error is recoverable or not,
however an user might still have a chance to judge by itself for many of
them.

[...]

word joke... ), I thought quite a bit about it, and I don't think
there are other solutions that could provide the same feature without
turning LAMMPS up side down completely, which is unlikely and condemning
the library interface to remain under-developed, which is a bit of a pity.

well, the kind of scenario that you are outlining, really seems to
require a complete redesign of the code. what you are effectively
after is some more "mainframe-like" approach, i.e. each operation
is not considered complete until it is committed (which would be
done in an atomic operation). i think there is a lot of merit in that
kind of approach, since this would be a strategy to build an MD code
for a future generation of extremely large scale hardware, where MTBFs
are becoming so short that one has to _assume_ failures (of
communication or computation or other components). this would open
routes for all kinds of features that lammps currently does not have
(or any other MD code i know): growing or shrinking of parallel jobs
to more or less nodes, depending on requirements of the calculation
and availability of resources, load balancing, fully transparent
checkpointing, "speculative" or redundant execution. these are all
things that make less sense with available hardware, but if you
add GPUs and the increasing discrepancy between I/O bandwidth and
compute capability to the equation lots of things will get meaningful.

while LAMMPS is a great example for a well designed, modular
scientific software package, i don't think it is fit for that kind
of usage and you would be better off to first develop a suitable
infrastructure and then add the science.

sorry for the lengthy comments. this is a topic that has been
floating through my mind for quite a few years now and it is
starting to bug me, that we are still getting away with using
technology that is effectively from the stone age. there are
too many factors that are the causing this, to discuss it here
on the list. i'd be happy to discuss about fault tolerant
schemes for doing MD with people off-list.

cheers,
axel.

Riccardo_Di_Meo · February 3, 2010, 3:24am

Axel Kohlmeyer wrote:

ciao riccardo,

I looked at your code additions. It's an interesting idea. But I don't see
how it would be useful. What is a scenario where the calling program generates
a LAMMPS error and can do something useful after that has occurred?


any attempt of automated execution would probably go through that, and below are some scenarios that comes to my mind now.

The program might have some finalization to do: It's a bit of a dumb example, but every code involving network connections usually requires some termination of sort to make the peer happy (at least to make it know what happened); wrapping the code with a C or C++ interface to provide some kind of graphical interface would become easier if the code wouldn't kill the interface itself on error, giving the GUI the chance to show one of those (annoying) error windows M$ users seem to be so fond of... (i'm the first not to like much the idea of a GUI, a web portal perhaps could make for a better case).

there are a number of problems with your suggestion. if you hook up
to a GUI, how do you plan to handle the case of parallel execution?

a GUI might be of some use for desktop users, however mine was just an example

it would be _much_ better and more portable to use a socket interface
and remote control lammps through that. this way you might not even need to link to the library, but just launch lammps from a fork'd
of process, open a pipe and then feed it script command with
write(2)/read(2). this way it is very easy to track the status
of your lammps execution.

I agree that what you are describing it's a fairly easy (and general) way to handle the execution of a code, and that it would work with LAMMPS too (in fact it is how I'm planning to interface it right now), but it's not the more efficient one: the devil is in the details, and this kind of approach requires a lot of string parsing/formatting which usually provides a frail interface (text output is not guaranteed to remain consistent, even between minor versions of a program) and requires a lot of tedious work, where both problems are avoided by a proper library interface.

As for the reason I came across the problem in first place, I have created (a quite raw and all but elegant...) python interface to LAMMPS and I noticed that the interpreter was killed in case of error. Now, I can work around the problem (spawn another process, communicate with it through a pipe and so on...) but at this point the tight python-LAMMPS integration through C becomes pointless and falling back to invocation of LAMMPS from the command line through os.system("lammps < something") becomes both more elegant and effective.

your elegant solution has a serious flaw. as far as i understand the MPI
standard, you must not call MPI_Init()/MPI_Finalize() multiple times.
so any attempt to re-launch a failed call to lammps will be a problem.

well... I didn't knew that (although i know you understand the MPI standard way better than I do, this looks still so strange I almost can't believe it! is it so with every implementation??), i guess it would work only in the serial approach, then, which probably would interests a very tiny minority of users.

I think there also could be many other cases, some regarding possible interfaces (scripting languags, portals, GUI etc...) and other which can be of some scientific significance; I'm not a chemist, but people might, for example, have the need for a program that moves an atom in some way and but if a "Cannot compute PPM" or some other error appear the code might take some countermeasure, like moving the same atom a little less or taking some other course of action altogether. I'm sorry if this "example" (quotes intended) looks so dumb, but my point is: if an user has the opportunity to intercept the errors and recover "in some way" from them, there will be eventually a use of it.

what you'd need here is using proper checkpointing/restarting. if the internal state of the library is messed up by an error, there is usually
no hope to recover from it _and_ keep running. codes like lammps have to
_assume_ that the user knows what he/she is doing, or else people would
need to spend even more time programming for the unusual cases, and less
on programming the important case. keep in mind, that you don't plan to
write software that does something vital. it is totally sufficient to
get a "bummer! you screwed up. better luck next time." type of error and
then have the calculation stop when you can analyze what went wrong.
any attempt to recover without launching a complete new process will
always have the risk of some "residual badness". first and foremost
people want results they can trust. writing restarts/checkpoints in a
timely fashion is the smartest way to recover from problems.

[...]

I didn't see things from this perspective, I think you are right.

Some errors however doesn't seem to leave LAMMPS in a bad state at all, I'm not sure but although passing a badly formatted command to the parser kills LAMMPS, the problem itself doesn't actually corrupt the state of the program so in principle this is an error from which recover

i don't agree with that assessment. you will for certain produce
memory leaks. most of the code simply was not written with that kind of scenario in mind, and i think you are wasting your time when you
want to fix that. keep in mind that LAMMPS is running MPI in a
synchronous way, so it is not easy to communicate problems from one node to others without adding lots of additional communication that
will impact performance. as i wrote before, the main concern is to run correctly and efficiently. do you think, you can implement a more graceful error handling without interfering with those goals?
it suspect not.

you are right again. It seem clear that my own case of use (which doesn't involve MPI at all...) led me to understimate how much the parallel execution would have complicated the whole problem...

could be possible. Since some of the LAMMPS code is provided by others there's no way to know in advance if an error is recoverable or not, however an user might still have a chance to judge by itself for many of them.

[...]

word joke... ), I thought quite a bit about it, and I don't think there are other solutions that could provide the same feature without turning LAMMPS up side down completely, which is unlikely and condemning the library interface to remain under-developed, which is a bit of a pity.

well, the kind of scenario that you are outlining, really seems to require a complete redesign of the code. what you are effectively after is some more "mainframe-like" approach, i.e. each operation is not considered complete until it is committed (which would be
done in an atomic operation). i think there is a lot of merit in that
kind of approach, since this would be a strategy to build an MD code
for a future generation of extremely large scale hardware, where MTBFs
are becoming so short that one has to _assume_ failures (of
communication or computation or other components). this would open
routes for all kinds of features that lammps currently does not have
(or any other MD code i know): growing or shrinking of parallel jobs to more or less nodes, depending on requirements of the calculation and availability of resources, load balancing, fully transparent
checkpointing, "speculative" or redundant execution. these are all
things that make less sense with available hardware, but if you add GPUs and the increasing discrepancy between I/O bandwidth and compute capability to the equation lots of things will get meaningful.

while LAMMPS is a great example for a well designed, modular scientific software package, i don't think it is fit for that kind of usage and you would be better off to first develop a suitable
infrastructure and then add the science.
  sorry for the lengthy comments. this is a topic that has been
floating through my mind for quite a few years now and it is
starting to bug me, that we are still getting away with using
technology that is effectively from the stone age. there are too many factors that are the causing this, to discuss it here
on the list. i'd be happy to discuss about fault tolerant
schemes for doing MD with people off-list.

"thank you for the lengthy comment", I would say: it was informative and you made some interesting points.

As for the topic of the thread, I think you stated your case pretty well and convinced me: I guess there's no much point in adding another level of complexity to LAMMPS, just for the sake of the community of people "running it serially wrapped in another scripting language" (which probably counts only me, and I do have a plan B already). Regardless how small the impact on the code/execution would be, it would be still a waste of time anyway.

Btw, I hope that my (somewhat over)zelous defense for my patch hasn't been taken as a criticism of LAMMPS: it definitely is an impressive piece of code (kudos to Steve and the other developers for their work)

cheers,
axel.

Ciao,
Riccardo

akohlmey · February 3, 2010, 6:10am

a few more comments... (sorry, can't help it).

it would be _much_ better and more portable to use a socket interface
and remote control lammps through that. this way you might not even need
to link to the library, but just launch lammps from a fork'd
of process, open a pipe and then feed it script command with
write(2)/read(2). this way it is very easy to track the status
of your lammps execution.

I agree that what you are describing it's a fairly easy (and general) way to
handle the execution of a code, and that it would work with LAMMPS too (in
fact it is how I'm planning to interface it right now), but it's not the
more efficient one: the devil is in the details, and this kind of approach
requires a lot of string parsing/formatting which usually provides a frail
interface (text output is not guaranteed to remain consistent, even between
minor versions of a program) and requires a lot of tedious work, where both
problems are avoided by a proper library interface.

actually, no. the library interface does nothing more than passing text
lines to the lammps parser (either reading from a file, or line by line).
this can be as easily relayed with a simple protocol. outside of that
you have the passing of coordinates back and forth, and reading information
form computes/fixes/etc. that can be relayed as well. just prefix it with
the proper tags. check out the fix IMD code that uses the IMD protocol
of VMD to exchange coordinate data over a socket. btw: in some sense,
i've also written a while ago a hack, that reads coordinates from a pipe
(this was used to feed lammps coordinates generated in VMD, run a few
steps fo minimization/MD and then return the energy. the VMD part of
that was written entirely in tcl, BTW).

in any case, the more elegant way of what you want to do (if it is still
the same that you told me in december), would be to do this as a "fix"
and then define a communication protocol. the biggest problem of the
library interface as far as i read it (never used it, though) is that
you cannot
interact with a running simulation, but that is exactly what you need. otherwise
you always have the overhead of initializing the system with each "run" and
then have to wait until it is full processed. how did you plan to exchange
other information? adding to the library interface via more (ugly)
pointer magic?.

[...]

Btw, I hope that my (somewhat over)zelous defense for my patch hasn't been
taken as a criticism of LAMMPS: it definitely is an impressive piece of code
(kudos to Steve and the other developers for their work)

don't worry about that. this is just the normal process of a discussion
about implementing some feature with rather deep impact on program
design. in those cases, concerns _have_ to be raised and it is usually
a productive experience for both sides. i have been in the same position
that you are in now, quite often and for various projects and do appreciate
people's concerns. the large a package gets, the more difficult is becomes
to keep it maintainable and caution and paranoia from one side is just
as important as passion and zeal to change for the better. for as long
as the discussion continues with respect and mutual appreciation, that is.

in that sense, good luck and don't get discouraged.

ciao,
axel.

Riccardo_Di_Meo · February 3, 2010, 12:55pm

Axel Kohlmeyer wrote:

a few more comments... (sorry, can't help it).

I tend to suffer from the "last word syndrome", so I'll try to keep a firm hold on my passions, wait for your next reply (if any) and call it a day

it would be _much_ better and more portable to use a socket interface
and remote control lammps through that. this way you might not even need
to link to the library, but just launch lammps from a fork'd
of process, open a pipe and then feed it script command with
write(2)/read(2). this way it is very easy to track the status
of your lammps execution.

I agree that what you are describing it's a fairly easy (and general) way to
handle the execution of a code, and that it would work with LAMMPS too (in
fact it is how I'm planning to interface it right now), but it's not the
more efficient one: the devil is in the details, and this kind of approach
requires a lot of string parsing/formatting which usually provides a frail
interface (text output is not guaranteed to remain consistent, even between
minor versions of a program) and requires a lot of tedious work, where both
problems are avoided by a proper library interface.

actually, no. the library interface does nothing more than passing text
lines to the lammps parser (either reading from a file, or line by line).

in the case of LAMMPS now this is true, and it's a good thing since the input format and the way in which LAMMPS handles it will allow for some interactivity during the execution (you can pass commands through pipe one by one) even when the program is launched as a binary (in the end, I don' need more than that myself); however this isn't true in general (think about PWscf... you have to read the whole file and only then the execution starts) .

When I was wrote about "a proper library interface", I was essentially thinking about a future development where, in order to get the temperature of your system, you call lmp->get_temperature() instead of grepping the logs.

I have more experience about grepping output than I would have liked (GRID*cough*...) and even if you can carry on doing that, regardless how well balanced your regexps are between doing a flexible or rigorous matching of the output, the resulting software is no the kind you would like to maintain (in fact after as few as 14 months, I have no idea if any of the programs I wrote are still compatible with the infrastructure they where designed for).

this can be as easily relayed with a simple protocol. outside of that
you have the passing of coordinates back and forth, and reading information
form computes/fixes/etc. that can be relayed as well. just prefix it with
the proper tags. check out the fix IMD code that uses the IMD protocol
of VMD to exchange coordinate data over a socket. btw: in some sense,
i've also written a while ago a hack, that reads coordinates from a pipe
(this was used to feed lammps coordinates generated in VMD, run a few
steps fo minimization/MD and then return the energy. the VMD part of
that was written entirely in tcl, BTW).

very nice! I'll give it a look: thank you for the pointer

in any case, the more elegant way of what you want to do (if it is still
the same that you told me in december), would be to do this as a "fix"
and then define a communication protocol. the biggest problem of the
library interface as far as i read it (never used it, though) is that
you cannot

I did it already: the fix_ms2 is up and running since about 2 months, it uses shared memory to transfer forces and positions back and forth between 2 LAMMPS instances and 1 PWscf code (I was concerned about the performances of the exchanges at the start, now that I know there's no reasons to be concerned about, I'll probably switch to some network transport...), and I'm still impressed about how easy it was to integrate the whole thing in LAMMPS: I still can't almost believe I put the fix together in one afternoon.

interact with a running simulation, but that is exactly what you need. otherwise
you always have the overhead of initializing the system with each "run" and
then have to wait until it is full processed. how did you plan to exchange
other information? adding to the library interface via more (ugly)
pointer magic?.

[...]

That was the 2th approach, the first one being to patch the code directly (Yep. a stupid but educative experience), which I did before taking the time to read the examples and learn about the library interface first (the aforementioned 2th approach) and the documentation about the fixes later. Please, don't say anything....

The only reason I needed the library was to steer the execution slightly (essentially I need to fit the code into the running queue, without the queue manager coming down and killing everything): nothing I couldn't have done with a pipe from the start, but since the director script would have been in python, the more elegant way to handle that would have been to provide a "LAMMPS module" where i could have issued a LAMMPS.run(steps) command. When I created one, and I have seen my interpreter dying with the code, well... it wasn't pretty.

I'm already working on plan B, anyway, which is similar to what you suggested, with the exception i will not use fork, which IMO offers the flank to a number of subtle bugs, especially when sockets and descriptors in general are involved and adds an extra layer to the code (not required in my specific case), and instead use spawn with I/O streams redirection.

In the end I didn't really needed the patch I sent (plan B was already in my mind): I just thought it could have been a nice addition to the code, that's all.

Btw, I hope that my (somewhat over)zelous defense for my patch hasn't been
taken as a criticism of LAMMPS: it definitely is an impressive piece of code
(kudos to Steve and the other developers for their work)

don't worry about that. this is just the normal process of a discussion
about implementing some feature with rather deep impact on program
design. in those cases, concerns _have_ to be raised and it is usually
a productive experience for both sides. i have been in the same position
that you are in now, quite often and for various projects and do appreciate
people's concerns. the large a package gets, the more difficult is becomes
to keep it maintainable and caution and paranoia from one side is just
as important as passion and zeal to change for the better. for as long
as the discussion continues with respect and mutual appreciation, that is.

in that sense, good luck and don't get discouraged.

ciao,
   axel.

thanks you for the encouragement Axel, and for sharing your point of views.

Trieste is possibly even colder than Philly now, I don't know how's life there (not as exciting as here for sure), but I hope you are enjoying it nonetheless

Ciao,
Riccardo

PS: I'm thinking about replacing my "old" ATI GPU with a brand new Nvidia for gam...*cough* coding purposes: please, send Ben my regards

akohlmey · February 3, 2010, 2:23pm

ciao riccardo,

Axel Kohlmeyer wrote:
> a few more comments... (sorry, can't help it).
>

I tend to suffer from the "last word syndrome", so I'll try to keep a
firm hold on my passions, wait for your next reply (if any) and call it
a day

i am worse than you are in that respect (seems to be a property of
the jobs description that you always have to prove yourself right).

but there is one more - hopefully useful - suggestion,
so i'm granting you one more reply, too. ;))

[...]

When I was wrote about "a proper library interface", I was essentially
thinking about a future development where, in order to get the
temperature of your system, you call lmp->get_temperature() instead of
grepping the logs.

actually, *that* should already exist. have you looked at?

void *lammps_extract(void *ptr, int category, char *id, char *name)

with category set to "3" and name set to a unique identifier

you just set up a compute with the same identifier to
get the temperature:
http://lammps.sandia.gov/doc/compute.html
http://lammps.sandia.gov/doc/compute_temp.html
and - bingo! - no more grep needed.

...and this is not limited to temperature. if you look through
the scripts that have been posted here over the years, people
have done some crazy stuff with computes.

[...]

Trieste is possibly even colder than Philly now, I don't know how's life

no. please compare:
http://www.wunderground.com/cgi-bin/findweather/getForecast?query=TRS&wuSelect=WEATHER
with
http://www.wunderground.com/cgi-bin/findweather/getForecast?query=19122&wuSelect=WEATHER
it just feels to you, because you are not used to it.

there (not as exciting as here for sure), but I hope you are enjoying it
nonetheless

Ciao,
Riccardo

PS: I'm thinking about replacing my "old" ATI GPU with a brand new
Nvidia for gam...*cough* coding purposes: please, send Ben my regards

will do. i'll have to walk over to pay a visit to our trusty
Saeco Magic Comfort+ and have my morning cappuccino. somehow
it seems difficult to procure proper coffee (from trieste!),
but we have the next best thing ( http://www.oldcitycoffee.com/ )
from the famous(?) Reading Terminal Market.
http://en.wikipedia.org/wiki/Reading_Terminal_Market

a presto,
axel.

sjplimp · February 3, 2010, 4:11pm

There has been lots of back-and-forth between Ricardo and Axel
on this thread - I juts have a few general comments.

maybe I'm wrong, but as far as I know, MPI doesn't terminate the user
program (and never happened to me at least)

The default MPI behavior is to abort on an error. Try
int me; MPI_Comm_rank(-2,&me);
You can change this to giving an error return via
MPI_Errhandler_set(). However, some errors leave
MPI in a bad state, where all future calls just return
an error. So you are effectively dead. Moreover, there
are errors that can occur inside MPI which can not be
recovered from even to give an error return, they just abort,
e.g. dropping a message, due to memory issues.

I agree with Axel that there are many LAMMPS errors which are
detected too deep to do anything useful to recover. Even syntax
errors in an input command may not be detected until new classes
and memory have been allocated, and the code is not written
to allow those conditions to be recovered from, e.g. to avoid losing
memory, or to unset things that were partially setup before the
error was detected.

I do agree, however, it might still be nice to let the
driver program trap these errors, via the exception mechanism
you propose. You should just realize the only safe thing to
do is destruct LAMMPS, and reinstantiate it. Even that can't
be guaranteed to be "safe" I don't think. At a minimum memory
could be lost - possibly the destruct() could crash.

If you want to interact with a running simulation, then a fix
is the way to do it - not the library interface. While the
library interface is barebones, it is really just meant to illustrate
what you can do. You can add any function you want to
library.cpp. All of the LAMMPS classes and data structures
are essentially exposed there, so you can poke and peek at
whatever you wish. The extract() function Axel mentioned is
one example of something that gives you that ability, as are
get_coords() and put_coords(). But your imagination is the only
limit.

I could add your exception wrapper to error.cpp and the lib interface
if you assure me of one thing I am unclear on. Since there
is no other exception/throw/catch logic in LAMMPS, will adding
this degrade the performace of normal LAMMPS at all, e.g. by
adding compiled code/logic to other parts of LAMMPS? Or does
it just affect the performance of the Error class and the library
interface itself (whichever routines catch the error). If the latter
is the case, then I don't mind adding it, so the caller can
receive an error return instead of an abort.

Also, if you have a Python wrapper for LAMMPS, that
lets you issue commands to LAMMPS from Python, I'd like
to see it, and possibly add it to the distribution. I've known
that's possible thru the LAMMPS lib interface, but haven't done
it myself.

Steve

sjplimp · February 3, 2010, 4:19pm

One more things. If you want to interact
with a running simulation and not write a fix,
you can use the run command, in small increments, e.g.

run 1
driver program does something
run 1
driver program does something
run 1
driver program does something
...

If you use the run command options for pre no and post no,
then this is actually a fairly lo-overhead way to continue a run.
I.e. it won't reneighbor every time it starts up, etc. Although
I might recommend run 10, instead of run 1.

Steve

Riccardo_Di_Meo · February 3, 2010, 9:01pm

Steve Plimpton wrote:

There has been lots of back-and-forth between Ricardo and Axel
on this thread - I juts have a few general comments.

maybe I'm wrong, but as far as I know, MPI doesn't terminate the user
program (and never happened to me at least)

The default MPI behavior is to abort on an error. Try
   int me; MPI_Comm_rank(-2,&me);

Not as far as I know: the code you suggested me, doesn't abort, just segfaults (mpich-shmem), which is what I expected it to do, and a different matter from my argument.

There's nothing strange if a coding error (like the example you pointed above) results in a segmentation fault: the point being discussed (or at least I think so) is that a library should never cause the termination of a code on purpose.

The problem with the LAMMPS interface (and please, keep in mind that I'm not pushing for the inclusion of my patch again, this is, from my point of view, more like an academical discussion than anything else) is that on errors it will cause the termination of the whole program that's linking it.

If someone includes a fix in LAMMPS that dereferences NULL and causes a segmentation fault, no one will ever complain if the code crashed, not with you at least (in fact, you can't avoid it), but if a runtime error happens, as a library, LAMMPS should make the error transparent to the user and guarantee that at least the LAMMPS instance could be correctly finalized (no segfaults or leaks of sort), this at least in principle.

Now, I will not repeat the arguments already exposed by Axel (if not else, because they prove me wrong ) and I agree that LAMMPS should probably keep going on the tracks it is already, especially if a change of direction would come for the sake of very dubious benefits (as Axel correctly pointed out).

I wouldn't ever barter a better library interface with the intuitive and clean "plug in" interface LAMMPS it's featuring right now, but as a matter of principle, a library to be called that in a proper way, should have some well defined characteristics, one of them is not to change the state of the calling program in other ways than the ones exposed by the caller itself (usually with arguments passing), which includes killing the program altogether.

You can change this to giving an error return via
MPI_Errhandler_set(). However, some errors leave
MPI in a bad state, where all future calls just return
an error. So you are effectively dead. Moreover, there
are errors that can occur inside MPI which can not be
recovered from even to give an error return, they just abort,
e.g. dropping a message, due to memory issues.

yes, but again they are examples that come as a result of the misuse of the MPI interface itself (e.g. passing a wrong shaped memory array, or a NULL pointer) or of some catastrophic system wide error, For normal, run-time errors, the "err" variable value can be retrieved and, although it gives no other choice to the user but to finalize the MPI interface, it still doesn't mess up the user's memory in the process, at least not intentionally.

I agree with Axel that there are many LAMMPS errors which are
detected too deep to do anything useful to recover. Even syntax
errors in an input command may not be detected until new classes
and memory have been allocated, and the code is not written
to allow those conditions to be recovered from, e.g. to avoid losing
memory, or to unset things that were partially setup before the
error was detected.

I do agree, however, it might still be nice to let the
driver program trap these errors, via the exception mechanism
you propose. You should just realize the only safe thing to
do is destruct LAMMPS, and reinstantiate it. Even that can't
be guaranteed to be "safe" I don't think. At a minimum memory
could be lost - possibly the destruct() could crash.

you are right, and I have to admit that I thought about the "scenarios" I pointed you in the last mail, and I concluded that some of them where quite bogus.

Both keeping an unstable version of LAMMPS that could be deeply corrupted by an error, or risking a memory leak (if not a segfault altogether) by finalizing the corrupted instance isn't the kind of risk I would take in one of my code, so why should some one else?

If you want to interact with a running simulation, then a fix
is the way to do it - not the library interface. While the

I did 2 of them already: btw, the interface to include them is really *great*

library interface is barebones, it is really just meant to illustrate
what you can do. You can add any function you want to
library.cpp. All of the LAMMPS classes and data structures
are essentially exposed there, so you can poke and peek at
whatever you wish. The extract() function Axel mentioned is
one example of something that gives you that ability, as are
get_coords() and put_coords(). But your imagination is the only
limit.

I could add your exception wrapper to error.cpp and the lib interface
if you assure me of one thing I am unclear on. Since there
is no other exception/throw/catch logic in LAMMPS, will adding
this degrade the performace of normal LAMMPS at all, e.g. by
adding compiled code/logic to other parts of LAMMPS? Or does
it just affect the performance of the Error class and the library
interface itself (whichever routines catch the error). If the latter
is the case, then I don't mind adding it, so the caller can
receive an error return instead of an abort.

The latter should be the case, but I admit my ignorance about the topic. I have searched the internet a bit and it seems like compiler dependent: it is possible to produce zero cost exception handling code (and g++ does that) but some compilers might add a very slight cost, at least in the routines catching the exceptions.

Anyway, for as strange as it might seem, I would at this point advise you to drop my patch anyway, since after the discussion with Axel it seems it would be of very dubious benefit, and as much as I'm sure it would not degrade anything, it just doesn't worth the trouble...

I just hope that the discussion was worth something by itself (I do think so, at least) and that I didn't wasted your time.

Also, if you have a Python wrapper for LAMMPS, that
lets you issue commands to LAMMPS from Python, I'd like
to see it, and possibly add it to the distribution. I've known
that's possible thru the LAMMPS lib interface, but haven't done
it myself.

Ok: I'll clean it a bit in the next weeks and send it as soon as it will be not embarrassing (and the MPI support will be inserted again: I didn't need it so I removed it...)

For now, although it works, it's a mix of SWIG (which just takes care of all the module initialization and types creations... I'm quite lazy) and pure C-Python code. A poor thing.

Riccardo

Riccardo_Di_Meo · February 3, 2010, 9:05pm

Steve Plimpton wrote:

One more things. If you want to interact
with a running simulation and not write a fix,
you can use the run command, in small increments, e.g.

run 1
driver program does something
run 1
driver program does something
run 1
driver program does something
...

If you use the run command options for pre no and post no,
then this is actually a fairly lo-overhead way to continue a run.
I.e. it won't reneighbor every time it starts up, etc. Although
I might recommend run 10, instead of run 1.

Steve

Thank you, I think I'll do something like that!

Just a quick question: would a patch to include a command to stop LAMMPS cleanly after a set wall time be of some use?

I'm not seeking for approval in advance, of course, but if the feature might be of some interest, I might find some time to put some code together...

Cheers,
Riccardo

sjplimp · February 4, 2010, 2:56pm

Also, if you have a Python wrapper for LAMMPS, that
lets you issue commands to LAMMPS from Python, I'd like
to see it, and possibly add it to the distribution. I've known
that's possible thru the LAMMPS lib interface, but haven't done
it myself.

Ok: I'll clean it a bit in the next weeks and send it as soon as it will be
not embarrassing (and the MPI support will be inserted again: I didn't need
it so I removed it...)

For now, although it works, it's a mix of SWIG (which just takes care of all
the module initialization and types creations... I'm quite lazy) and pure
C-Python code. A poor thing.

You should look at the Python ctypes module. You can do this with no SWIG,
no extra C code. Just a bit of Python, Ctyes is amazing in my
opinion. I've wrapped stuff much more complex than the LAMMPS interface
with it, and it just works.
E.g. see our mapreduce library (www.sandia.gov/~sjplimp/mapreduce.html)

Steve

Riccardo_Di_Meo · February 4, 2010, 3:11pm

Steve Plimpton wrote:

Also, if you have a Python wrapper for LAMMPS, that
lets you issue commands to LAMMPS from Python, I'd like
to see it, and possibly add it to the distribution. I've known
that's possible thru the LAMMPS lib interface, but haven't done
it myself.

Ok: I'll clean it a bit in the next weeks and send it as soon as it will be
not embarrassing (and the MPI support will be inserted again: I didn't need
it so I removed it...)

For now, although it works, it's a mix of SWIG (which just takes care of all
the module initialization and types creations... I'm quite lazy) and pure
C-Python code. A poor thing.

You should look at the Python ctypes module. You can do this with no SWIG,
no extra C code. Just a bit of Python, Ctyes is amazing in my
opinion. I've wrapped stuff much more complex than the LAMMPS interface
with it, and it just works.
E.g. see our mapreduce library (www.sandia.gov/~sjplimp/mapreduce.html)

Steve

Thank you for the suggestion, I actually know the ctype module already, however I usually prefer not to use it: my arguments against it are quite weak (I think it falls under "personal preference"), however I think the module is a bit too dangerous for most uses.

The nice thing about hand-crafted modules in my opinion is that if you do things well enough, you aren't going to expose to the user anything that could potentially crash his/her code, and as SWIG requires some proxy class to work in a secure way.

I agree Ctypes is amazing (is the first thing I thought when I have seen it), but it is also a very big gun to potentially shot yourself in the foot...

Although I'm going to change this for the python interface to LAMMPS (thanks for the interest shown for it, btw!) my usual approach is to use SWIG to handle the "boring parts", like module tables and initialization, and to code the most interesting parts in C: it's not very clean, but it's fast and allows me to handle some structures better than SWIG would

Cheers,
Riccardo