[lammps-users] NFS output difficulties

I wanted to pass along a potential problem I’ve come across in doing output via the dump command. When running simulations my dump files sometimes stop being filled (i.e. the files are created but are totally empty), then after many thousands of timesteps the dump files return to normal. The problem is relatively rare so it is not a critical bug.

Despite running through a series of test runs I have not been able to figure out the source of this problem. In talking through the problem with the system manager for the computing cluster I am using he thought it might be a problem with the way the LAMMPS C++ code interacts with the network filesystems (NFS).

The ‘NFS’ filesharing protocol that is used to share disk space across the cluster (at least the one I’m using) has a few apparent quirks…

  • File attributes (including size) are cached locally and will not always
    be up-to-date with the version on disk.

  • Attribute updates can be delayed until the machine writing the file
    explicitly asks. This happens on close.

In addition,

  • Linux only pretends to write directly to files. The data is really held
    in memory until there is enough to flush to disk or 30 seconds has
    passed.

That being said this is only one potential idea of what might be causing the problem I’m seeing so if anyone else has had this type of difficulty with LAMMPS and was able to solve/identify it please let me know.

Thanks,

Dan

I wanted to pass along a potential problem I've come across in doing output
via the dump command. When running simulations my dump files sometimes stop
being filled (i.e. the files are created but are totally empty), then after
many thousands of timesteps the dump files return to normal. The problem is
relatively rare so it is not a critical bug.

dan,

there are two issues that play into this behavior.

1) general i/o buffering. the stdio library (which is part of
the basic c-library and which LAMMPS uses) has an integrated
buffering scheme for increased performance. data is written in
blocks the size of which is determined by the operating system.
typical sizes are 4-8kbyte.

2) time synchronization of the NFS server. if the time source
of the NFS server is running ahead of the client system, than
files that are newly written appear to be _very_ old (due to
integer wraparounds) and thus throw all attempt of client side
i/o caching totally off balance.

the NFS protocol can be changed to be synchronous, but that will come
with a huge performance penalty.

Despite running through a series of test runs I have not been able to figure
out the source of this problem. In talking through the problem with the
system manager for the computing cluster I am using he thought it might be a
problem with the way the LAMMPS C++ code interacts with the network
filesystems (NFS).

The 'NFS' filesharing protocol that is used to share disk space across the
cluster (at least the one I'm using) has a few apparent quirks...

* File attributes (including size) are cached locally and will not always
be up-to-date with the version on disk.

* Attribute updates can be delayed until the machine writing the file
explicitly asks. This happens on close.

In addition,

* Linux only pretends to write directly to files. The data is really held
in memory until there is enough to flush to disk *or* 30 seconds has
passed.

files are also syncronized when calling "fflush" and particularly
when files are closed. the whole thing is very normal and explained
in any books on the unix i/o subsystem. i recommend having a look into
the classic book "advanced programming in the unix environment" that
goes over the i/o buffering issues in great detail.

cheers,
   axel.

Most file systems (including NFS) will buffer output, meaning
it willl only be written in chunks periodcally. The chunk size is
a system attribute. You can use the dump modify flush yes
command to insure the file is fully written after every snapshot.
Ditto for thermo output to a log file via thermo_modify flush yes.

For dumps "flush yes" is the default. So this shouldn't be happening.
All the things you list as cached should be updated by a flush.

Steve