Dealing with separate instances of LAMMPS

_Allen_Thomas_Carlto · November 5, 2013, 11:36pm

Hi all,

I’m working on a program that couples LAMMPS to some other code, and as it currently stands, I split up my set of processors in MPI and run separate instances of LAMMPS on small groups of them (say 2-4 processors). A problem I’ve discovered with this is that if one instance fails, it won’t necessarily kill the program in general, and this can lead to hangups without a clear cause. I was wondering if there was a way to check the status of a given instance and kill the program if some group has failed, since I couldn’t seem find return values which would indicate that sort of thing in the source files I checked.

Alternatively, should I consider trying to find a way to partition the set of processors within a single LAMMPS instance? Would that be equivalent to the way I’m doing things (I don’t really want the separate partitions talking to each other)? I haven’t tried it yet, but this might at least address the hanging issues.

Thanks,
Thomas Allen

akohlmey · November 5, 2013, 11:43pm

Hi all,

I'm working on a program that couples LAMMPS to some other code, and as it
currently stands, I split up my set of processors in MPI and run separate
instances of LAMMPS on small groups of them (say 2-4 processors). A problem
I've discovered with this is that if one instance fails, it won't
necessarily kill the program in general, and this can lead to hangups
without a clear cause. I was wondering if there was a way to check the
status of a given instance and kill the program if some group has failed,
since I couldn't seem find return values which would indicate that sort of
thing in the source files I checked.

i see two ways of addressing this issue.

a) you can change the default MPI error handling to make them
non-fatal and also modify the error class in LAMMPS to be more
forgiving and store a global variable somewhere that says "i am ok" or
"there was a problem". then you can check from your top level code.

b) you don't do MPI_Abort on a partitioned communicator but always on
MPI_COMM_WORLD.
that will then terminate the entire calculation, or at least the MPI
library will make an effort kill everything as good as possible.

i personally favor the second option and i've implemented that in one
of the codes that i am currently working on.

Alternatively, should I consider trying to find a way to partition the set
of processors within a single LAMMPS instance? Would that be equivalent to
the way I'm doing things (I don't really want the separate partitions
talking to each other)? I haven't tried it yet, but this might at least
address the hanging issues.

MPI libraries like OpenMPI allow the use of an "appfile" but i don't
think that will really help you.

axel.