I run a simulation on local machine, the trajectory files are written to the nfs server directory.
I found when the nfs server is down, lmp still run. Obviously, lmp is running for nothing
because it can’t write anything to the nfs server.
So, I’m wondering if lmp should reinforce checking files’ validity every several steps
or do something to forbid such thing appear.
Thank you for your great works.
I run a simulation on local machine, the trajectory files are written to the
nfs server directory.
I found when the nfs server is down, lmp still run. Obviously, lmp is
running for nothing
because it can't write anything to the nfs server.
you have to read up a little about NFS and file i/o semantics.
lammps cannot even tell whether it is writing to an NFS server
or not, since the file i/o semantics in the stdio library do not
provide that information for as long as the NFS client code
does not signal a failure. if your NFS mount is a "hard" mount,
it will take a long time until such a failure will be signaled.
depending on the other settings of the NFS mount, lammps
will first continue to run and write to a buffer until it runs out
of buffer space (determined by the OS) and will then stall
(which may happen as "busy waiting" depending on the
MPI library that you are using) until the file system will become
writeable again. and unless your system administrator did
something to the NFS server that messed up the inode
hashes, this is exactly what is supposed to happen. this
is defined in the NFS protocol.
So, I'm wondering if lmp should reinforce checking files' validity every
several steps or do something to forbid such thing appear.
other people would be very upset, if their calculation would be
needlessly terminated (with a full i/o buffer still pending to be
written to disk) instead of continuing as if nothing had happened
at the moment the file server re-appears.
at least this is how things happen with the NFS servers that i
manage. people often don't even notice that i had taken down
an NFS server for a brief maintenance.
overall, this is more an issue of well organized system administration
and proper workflow and data management of the user, rather than a
problem of lammps. if you don't want to run into the same situation,
you can just write your output to a local disk and then copy over
the results at the end of the job.
Axel, thank you for your detailed reply!
I’m very appreciative that you let me know so much knowledge.