Lately I have been running a vast number of calculations on very large simulation boxes 2mln atoms and more (every output file takes between 100 and 250 MB). The problem I am facing now is how to efficiently store the output of these calculations (data.dat files in atomic style) in an efficient manner. Possibly achieving some degree of compression of my data.
The calculations are tagged by appropriate metadata. So, I was thinking about some structured database solution of some sort.
Do you have any suggestions? Any input is welcome!
If you have your data already organized the way you say, that doesn’t buy you anything, since a file system is a structured database of sorts. Using a full fledged database just hides your data behind its query language and add additional storage needs for indexing.
The first thing you need to worry about is creating additional copies and obtaining checksums, so you can detect if the data is damaged. If you want to save space, you can compress the individual files or store the data on a file system with built in compression like btrfs or zfs.
Hello @akohlmey, thanks a lot for your suggestions.
Do you have specific recommendations of libraries that would help me with that?
I still think that putting the calculation outputs into a database would be helpful as my data are arranged as calculation folders (each with a metadata file), and navigating them is not very easy.
Thanks,
Lorenzo
How would that be any different with a database?
Also, databases are inherently less reliable: if the database is corrupted, everything is lost.
If you have difficulties locating specific files, you can just use standard Unix/Linux procedures.
The simplest would be to run the command ls -R > ls-R (watch out for the spaces) in the top folder and then you can easily look up specific files or folders by searching the ls-R file. If you also want to record the file sizes, use ls -lR > ls-lR.
An alternative would be to install and configure the locate program (or mlocate or plocate), if it is not already installed and configured. This will regularly update an index of all files and then you can use locate <string> to locate all files that contain <string>.
You should be able to compress the datafiles seamlessly by simply adding .gz suffix to the filename in LAMMPS input script (if it was compiled with zlib library - usually is). Moreover, you may want to round atoms’ coordinates and velocities to 4-5 decimal places. This should decrease the size of files further. If your structures are saved in 0K temperature, you can also delete the entire Velocities section.