Working with Disk Quotas

In place of a feature request issue where I’m not even too sure what the request would be, let’s try this. I’ve recently run into a problem I suspect others here know: our HPC system uses BeeGFS with enforced disk quotas for storage. New users start with low default values, increased quotas are set when required.

In any case, this lead to an interesting situation with a LAMMPS run: once the hard quota is exhausted, disk writes fail with EDQUOT (Disk quota exceeded), but since LAMMPS usually doesn’t check the result of fput, fwrite & co, it just keeps going but doesn’t produce any output. This is mostly an annoyance since all monitoring looked like the simulation was still running, but it wasn’t useful anymore.

I have since found out about issue #934 Checking restart file when saving / before loading and the implementation of Add support to detect incomplete restart files and insufficient diskspace conditions that added the diskfree option to fix halt. Downside: that doesn’t take quotas into account, and as far as I can tell, beegfs doesn’t even have an ioctl like the standard quotactl(2) interface so that’s not helping anyway.

Other programs (or even their runtime, i.e. Fortran) just die on write errors, but it seems that would be difficult to do now as the original remark form 2018 still stands:

But who’s going to do this “dirty deed”? there are loads of writes and reads scattered across many styles

The best I could come up with so far is utils::logmesg, which is a nice single point of failure to catch at the very least the point where lmp->logfile is on a disk that is full. But that is so incomplete that I wouldn’t even consider that for a PR.

Any other ideas? How do people generally handle that? I mean, I ran into the issue once, have asked and received a higher quota and now it’s not going to occur again so I’m probably overthinking it, but still…

Best,
Sebastian

I think one important issue is that by the time you hit the hard quota limit, it is already too late to do any kind of “rescue operation”. The only item you can “rescue” is your CPU time budget.
This is made worse by using buffered I/O, so the real write failure would come only after a buffer is full and needs to be flushed, so you are losing the buffer content (typically 4k or 8k of data).

What I would try to do is to set up some detection of running out of quote before you get to this point. There has to be some command or library call that can be used to retrieve the quota status. That could be wrapped into a python style variable. which could then be used in combination with fix halt to have a clean stop ahead of any problems. It may be a good idea to add a little how to for that to the manual. We have a system with GPFS (or whatever it is called now) using quotas, so I could contribute an example for that.

In order to “catch” write failures, I would suggest to do the same thing as we do for reading, i.e. have some wrapper functions like utils::sfgets().

Absolutely. For me “disk write error” is a fatal failure, it would make sense for LAMMPS to also “fatally fail” instead of silently carrying on.

Something like sfput could be done, but the dump styles use a mix of fprintf, fwrite, fput etc, often many small unbuffered calls. Seems better to have one errno check at the end of each higher-level write operation… but then where is that. I can see why nobody has done that so far…

The script variable is a possible idea, although I don’t really like having to compile with Python just for that. But it would be fairly straightforward, beegfs-ctl has a (more or less) machine-readable output mode. Except I would also have to deal with MPI, polling this from hundreds of processes is probably not the best idea…

You could probably do something similar with a shell command and then appending it’s output to a file and use a file style variable on that. The shell command would only execute on MPI rank 0, if I remember correctly.

For dump styles a good place for a check would be at the end of the Dump::write() function.
Dumps are complicated because they may open a file per frame and make write from one or multiple or all MPI ranks. But something like this could work to trigger a hard abort (I personally favor the “soft” stop before the quota is exhausted):

  diff --git a/src/dump.cpp b/src/dump.cpp
  index 3569d32165..a7377809ef 100644
  --- a/src/dump.cpp
  +++ b/src/dump.cpp
  @@ -517,6 +517,8 @@ void Dump::write()
   
     if (refreshflag) modify->compute[irefresh]->refresh();
   
  +  if (fp && ferror(fp)) error->one(FLERR,"Error writing dump {}: {}", id, utils::getsyserror());
  +
     // if file per timestep, close file if I am filewriter
   
     if (multifile) {

Sounds to me like this is best solved with a local patch specific to our local environment and common uses.

I could see a new option like fix halt diskfree that on rank 0 calls out to beegfs-ctl, parses the output and then broadcasts the result to the other ranks. Does anything look obviously wrong with this idea?