Thanks for your investigation. @ladinesa has to decide if there is any downside/intension to this.
There is probably more room for improvements. From a fear of premature optimisation, we haven’t delved much into this potential yet. For example, the Metainfo serialisation creates a JSON representation of the Archive in memory. Thats basically another potential doubling of data in memory and should be replaced with direct writing to an output stream. Maybe there is more room to tune the file memory mapping. The original goal of this is to not have the whole file in memory.
On the other side, UnstructuredTextFileParser is intentionally not scalable. It separates parsing and transforming parsed information into an Archive into two separate steps. This requires an in-memory intermediate representation of all information. In most cases this is fine and this design choice was done to simplify the code-structure. If you expect large file sizes, a different approach might be necessary.
Currently, we see two extremes for NOMAD parsers. One is “small” ASCII-files, where we wrote UnstructuredTextFileParser and the Metainfo for. The other is “large” (binary) files (e.g. HDF5). For the later case, we argue that no Metainfo/Archive representation should be created for the actual data (e.g. large MD system trajectories, 4D STEM images). If your data goes into the upper GB or lower TB region, you’ll probably need to a have an optimised and specialised data format (and tools) to handle this data. Here, parsing and Metainfo should only be used to cover metadata, but not the actual data is self. We plan to have reference support between Archives and HDF5 files in the future to make this happen.
We are currently not planning for any solutions in between those extremes. It is fathomable that you could write a “streaming” parser. Maybe with an HDF5 backed Metainfo (not implemented yet) or something similar that will allow to write data parsed from an input stream directly into an Archive on disc. But currently, we have no direct use-case for this.