Suggestions on storing large amounts of large output files

LorenzoPiersante · June 10, 2025, 8:13am

Hello everyone,

Lately I have been running a vast number of calculations on very large simulation boxes 2mln atoms and more (every output file takes between 100 and 250 MB). The problem I am facing now is how to efficiently store the output of these calculations (data.dat files in atomic style) in an efficient manner. Possibly achieving some degree of compression of my data.
The calculations are tagged by appropriate metadata. So, I was thinking about some structured database solution of some sort.

Do you have any suggestions? Any input is welcome!

Many thanks,
Lorenzo

Germain · June 10, 2025, 10:18am

Hi @LorenzoPiersante,

have you considered using LAMMPS gz outputs which write compressed files through zlib of HDF5MD outputs which are binary and quite light weight?

More info here and there.

akohlmey · June 10, 2025, 10:23am

If you have your data already organized the way you say, that doesn’t buy you anything, since a file system is a structured database of sorts. Using a full fledged database just hides your data behind its query language and add additional storage needs for indexing.

The first thing you need to worry about is creating additional copies and obtaining checksums, so you can detect if the data is damaged. If you want to save space, you can compress the individual files or store the data on a file system with built in compression like btrfs or zfs.

LorenzoPiersante · June 11, 2025, 8:45am

Hello @akohlmey, thanks a lot for your suggestions.
Do you have specific recommendations of libraries that would help me with that?
I still think that putting the calculation outputs into a database would be helpful as my data are arranged as calculation folders (each with a metadata file), and navigating them is not very easy.
Thanks,
Lorenzo

LorenzoPiersante · June 11, 2025, 8:46am

Thanks, I will consider that for future calculations.

akohlmey · June 11, 2025, 8:57am

Help you with what exactly?

How would that be any different with a database?
Also, databases are inherently less reliable: if the database is corrupted, everything is lost.

If you have difficulties locating specific files, you can just use standard Unix/Linux procedures.
The simplest would be to run the command ls -R > ls-R (watch out for the spaces) in the top folder and then you can easily look up specific files or folders by searching the ls-R file. If you also want to record the file sizes, use ls -lR > ls-lR.

An alternative would be to install and configure the locate program (or mlocate or plocate), if it is not already installed and configured. This will regularly update an index of all files and then you can use locate <string> to locate all files that contain <string>.

LorenzoPiersante · June 11, 2025, 9:47am

I mean the compression to btrfs or zfs. For instance, do you know any Python libraries that provide support for that? Or programs I could install?

akohlmey · June 11, 2025, 9:53am

Btrfs and zfs are file systems. Please re-read my previous comment carefully and follow it up with a web search.

mkanski · June 11, 2025, 9:58am

You should be able to compress the datafiles seamlessly by simply adding .gz suffix to the filename in LAMMPS input script (if it was compiled with zlib library - usually is). Moreover, you may want to round atoms’ coordinates and velocities to 4-5 decimal places. This should decrease the size of files further. If your structures are saved in 0K temperature, you can also delete the entire Velocities section.