Memory consumption of UnstructuredTextFileParser

If I parse a file with UnstructuredTextFileParser, the memory consumption increases by at least 5 times the file size. I don’t understand this too much, considering that most of the saved values are floats and the parser is looking like it already does the proper conversion, (i.e., the floating point representation should be in fact more memory efficient).

I’m working on a new parser where I can have big output files, like over 1GB (long MD trajectories for large system with forces, velocities, etc.) and I’m looking at options how to reduce the memory consumption (besides writing a custom readline parsing).

This is also a problem for the xml parser, see: Heavy memory usage for large vasprun.xml files · Issue #12 · nomad-coe/nomad-parser-vasp · GitHub

Nevermind, this is likely not a problem of UnstructuredTextFileParser but a general python feature, I never realized simple integer takes 28 bytes in Python. I’ll try to make heavier use of numpy, lets see if it helps…

Actually, it looks like the UnstructuredTextFileParser parser is caching the file into memory for parsing and as far as I can see this in-memory representation is not freed/released after the parsing.

This might be intentional for some use case I don’t understand right now, (like running parse twice?), but when one is using the UnstructuredTextFileParser in the same way as the example parser is, i.e., calling .parse() directly at the beginning and later only calling .get() for the parsed quantities, the file stays in memory for the whole time.

I tried adding del self._file_handler at the end of the class TextParser(FileParser) parse() method. Nothing seems to break right away and the memory is free properly, so maybe it could be as simple as that?

Thanks for your investigation. @ladinesa has to decide if there is any downside/intension to this.

There is probably more room for improvements. From a fear of premature optimisation, we haven’t delved much into this potential yet. For example, the Metainfo serialisation creates a JSON representation of the Archive in memory. Thats basically another potential doubling of data in memory and should be replaced with direct writing to an output stream. Maybe there is more room to tune the file memory mapping. The original goal of this is to not have the whole file in memory.

On the other side, UnstructuredTextFileParser is intentionally not scalable. It separates parsing and transforming parsed information into an Archive into two separate steps. This requires an in-memory intermediate representation of all information. In most cases this is fine and this design choice was done to simplify the code-structure. If you expect large file sizes, a different approach might be necessary.

Currently, we see two extremes for NOMAD parsers. One is “small” ASCII-files, where we wrote UnstructuredTextFileParser and the Metainfo for. The other is “large” (binary) files (e.g. HDF5). For the later case, we argue that no Metainfo/Archive representation should be created for the actual data (e.g. large MD system trajectories, 4D STEM images). If your data goes into the upper GB or lower TB region, you’ll probably need to a have an optimised and specialised data format (and tools) to handle this data. Here, parsing and Metainfo should only be used to cover metadata, but not the actual data is self. We plan to have reference support between Archives and HDF5 files in the future to make this happen.

We are currently not planning for any solutions in between those extremes. It is fathomable that you could write a “streaming” parser. Maybe with an HDF5 backed Metainfo (not implemented yet) or something similar that will allow to write data parsed from an input stream directly into an Archive on disc. But currently, we have no direct use-case for this.

Yeah, it is tricky, many of the DFT codes go back many years, so the output format itself was created at a time when it could run few tens/hundreads atoms for single self-consistent loop. Now you can run hundreds/sometimes even thousands of atoms for thousands of MD steps with standard DFT, and the output volume scales accordingly. So most of the calculations for some DFT code can have few MB and than you can encounter output with few GB.

In my specific case I was hoping to have the full MD trajectories and forces in NOMAD for later use for machine learned force fields. I don’t care about the parsing speed too much, I just need to stay clear of the oomkiller :slight_smile: while keeping reasonable celery concurrency to not break parsing speed for everything else. An I think reasonable memory consumption could be something like 2 times the largest file size. That should not be too difficult hopefully.

Regarding the UnstructuredTextFileParser, after reading the code for a second time (but I don’t claim any real understanding), it seems it allows to call parse just with the specific Quantity key. In such case it makes sense to keep the file in memory. However if one calls parse without any arguments, than all the quantities are parsed, i.e., we extracted all we can from the file, and then it does not make sense to keep the file in memory IMO. Anyway, this doesn’t need a solution on the UnstructuredTextFileParser level, I can easily free the memory from the parser itself.