[lammps-users] problems with multiple partitions

Hello,
I'm running simulations with -partition 32x4 using `universe' variable
and have two kind of problems happen from time to time.

The first is a large number in tmp.lammps.variable. The output is:

LAMMPS (9 Jan 2009)
Running on 32 partitions of processors
Initial \{d\} setting: value 2 on partition 1 Initial {d} setting: value 6 on partition 5
[...]
Increment via next: value 138 on partition 26
Increment via next: value 139 on partition 3
Increment via next: value 140 on partition 23
Increment via next: value 11299649 on partition 16
Increment via next: value 141 on partition 9
Increment via next: value 11299650 on partition 5
Increment via next: value 11299651 on partition 21
Increment via next: value 11299652 on partition 1

It looks like the tmp.lammps.variable file was somehow corrupted due
to race condition, but that's only my guess.

The other problem is
ERROR on proc 79: Failed to reallocate 262848 bytes for array fix_minimize:x0
The system is very small, so I didn't run out of memory and this error
never occurs when -partition is not used (even when simulating several
times larger systems).

Have anyone also had these problems?
Can you think of any possible solutions?

Thanks,
Marcin

Is the tmp.lammps problem reproducible? Can you post an
input script with the runs themselves being as tiny as possible?

Re: the memory issue - I have seen these kind of errors
when MPI grabs memory and doesn't give it up to the
application, but it has nothing to do with -partition.

Steve

Is the tmp.lammps problem reproducible? Can you post an
input script with the runs themselves being as tiny as possible?

With such a script:

variable d universe C157-4-x0y8 C157-4-x2y0 C157-4-x2y10 C157-4-x2y12
C157-4-x2y14 C157-4-x2y16 C157-4-x2y18 C157-4-x2y20 C157-4-x2y22
C157-4-x2y24 C157-4-x2y26 C157-4-x2y28 C157-4-x2y2 C157-4-x2y30
C157-4-x2y32 C157-4-x2y34 C157-4-x2y36 C157-4-x2y38 C157-4-x2y4
C157-4-x2y6 C157-4-x2y8 C157-5-x0y0 C157-5-x0y10 C157-5-x0y12
C157-5-x0y14 C157-5-x0y16 C157-5-x0y18 C157-5-x0y20 C157-5-x0y22
C157-5-x0y24 C157-5-x0y26 C157-5-x0y28 C157-5-x0y2
echo both
print "VALUE: $d"
clear
next d
jump input

it happens with roughly one on three runs on my clusters (I tried this
with partions 4x4 or 8x2 or 8x1).
The problem usually is that when reading tmp.lammps.variable.lock,
fscanf returns EOF.
I added checks for return values of fopen and fscanf and if it fails I
retry a few times (I close the file, wait 10ms and try to open and
read it again). It almost always works. From time to time it still
happens that somehow two worlds get the same variable or several
variables are skipped, but I have no idea what to do with this. The
filesystem is mounted by NFS (rw,bg,tcp,intr) on Linux 2.6.22.

Re: the memory issue - I have seen these kind of errors
when MPI grabs memory and doesn't give it up to the
application, but it has nothing to do with -partition.

I didn't think that the problem is inside the LAMMPS, but in my case
using -partition makes it more probable. I've had the problem only
with realloc(), so I increased LB_FACTOR in read_data.cpp to avoid
reallocs. I don't know yet if it helps, though.

Marcin

This example fails for me once in a while also. I think it's
b/c LAMMPS is relying on the idea that if multiple procs
try to simultaneously rename() a file,
only one will be successful.

I thought that was an atomic operation, but now I don't think it is.
Looks like fcntl() is the right mechanism to do file locking
on NFS-mounted disks. I've got a query in to an expert here to see
if he can tell me how to use it correctly.

If your memory issue is one where a malloc is failing when there
should clearly be plenty of memory, then I don't think it is a LAMMPS
problem, but an MPI or system problem. I've seen/heard of it happening
occasionally elsewhere, but I don't know how to diagnose it further. It
would be great if you could instrument the code and find out more about
the root cause on your machine.

Steve

Here's what I've learned about file locking. NSF file
systems have problems with this. There is something
more lo-level I can use to attempt it - which is the C-lib
fcntl(). I don't believe there is any way to do what I want
in MPI itself. Fcntl() will hopefully work more reliably,
but the expert I talked to said it can screw up (i.e. not truly
lock the file in an atomic fashion). And it can't tell you
it had a problem. If your NSF (or other file system) is
running "lockd" then it will be guaranteed to work.
That is a daemon which would be running on your disk server -
not something controllable by LAMMPS.

All of this should be rather moot if the set of jobs you are trying
to run via the universe variable, as you described them, are
reasonably big jobs of (slightly) varying lengths. So that you're
not unlucky enough to have 2 or more procs banging on the
same file at precisely the same moment. The example we've
been using to expose the problem is bad in this sense b/c it
is a trivial "job" that doesn't do anything. Hence it runs lickety-split.
I would have thought a real set of LAMMPS jobs wouldn't have
this problem. But obviously you proved me wrong.

If I can figure out the opaque fcntl() man page, I'll try it in LAMMPS
and see if the perverse example does any better. But you might
check out your NSF file system and lockd. If anyone can send
a few lines of fcntl() code that does a simple file lock, I'd appreciate
it. Search for rename() in variable.cpp and you'll see what it needs
to replace.

Steve

Steve,

There is a process lockd running on the cluster I use. If you haven't
had time to work out how this fcntl() works yet, I can try to read
about it and if I understand it I'll implement and test it during the
weekend. But let me know if you already implemented it.

I'm also wondering why it happens that two jobs access the same file
in exactly the same moment. I often happens even after several hours
of running. The systems have the same number of atoms and run the same
number of steps, but anyway the time of runs should be slightly
different. Perhaps the jobs get synchronized by NFS when waiting for
I/O operations.

Thank you for your help,
Marcin

I haven't done anything further than my email, besides
stare blankly at the fcntl() man page. So have a go
if you like.

Steve

I'm attaching a patch, but it's not well-tested yet.

First about the old approach: I added checking for return values of
fopen() and fscanf() and increased sleep time. This prevents at least
90% of collisions on my cluster.

fcntl()-locking is implemented according to the man page on Linux, but
I have no idea how portable it is. It also didn't work twice (the job
run out of time) when I was testing it with the trivial script I sent
before, but I can't reproduce it any longer. Now it seems to work
fine.

Marcin

locks.diff (2.41 KB)

I'll take a look in the next couple weeks. But this
doesn't sound promising. Since the simple test
should never take long, it sounds like the fcntl() is
making it hang. Which is worse than skipping entries
I think. Does the former solution with altered sleep
always complete, just mess up less often?

Steve

I'll take a look in the next couple weeks. But this
doesn't sound promising. Since the simple test
should never take long, it sounds like the fcntl() is
making it hang. Which is worse than skipping entries

During the next few days or weeks I'll test how it works with normal jobs.

I think. Does the former solution with altered sleep
always complete, just mess up less often?

Yes, it rarely skips any entries now.

Marcin

guys,

am i right, that the whole issue is to do a "bag of tasks" kind of
parallelization?
and that the scheduling protocol is currently done in the way that each "world"
in the "universe" has the whole list of strings and then decide on which task to
do next is decided by reading the tmp.lammps.variable file, taking the index
from there, incrementing it and writing it out? and the rename to
tmp.lammps.variable.lock
is to keep other tasks from interfering?

instead of making the locking work on NFS (which is a thankless job, even with
NFSv4, due to too many badly run clusters with unsynchronized clocks), how about
setting up a communicator between all (me == 0) tasks in the universe and then
just do a regular call to a do_schedule() function on the MPI rank 0.
The various
me == 0 tasks then just ask the "grandmaster" for the next index and get it or
a terminate message (index == -1) from it at the next do_schedule().
do_schedule()
would poll the message queue, if there are any pending requests,
process them and
then go back to the normal execution. if you do it with MPI_Isend() /
MPI_Irecv() / MPI_Wait()
it should not collide with everything else.

just my two units of minor currency,
    axel.

The various
me == 0 tasks then just ask the "grandmaster" for the next index and get it or
a terminate message (index == -1) from it at the next do_schedule().

The problem is that there is no "grandmaster" task looking for these messages.
Every proc is doing tasks, and it makes little sense to add code in
the timestepper
to look for these messages, when this is a rare-use-mode.

There is also no way for an MPI sender to interrupt the receiver.
MPI 2 does have one-sided comm calls that implement a shared memory
location, but they won't solve this problem
either, b/c there is no atomic get-and-accumulate. Besides, I don't want
to require MPI 2.

So I'm pretty confident there is no good way to do this in MPI.

Steve

Just an update: I've tested fcntl()-locking on real samples since I
sent the patch, and it seems to work fine. There were >1000 tries to
access the locked file, but it never happened that the same variable
is assigned to two worlds or that it is skipped.

Marcin