Lately I have been running a vast number of calculations on very large simulation boxes 2mln atoms and more (every output file takes between 100 and 250 MB). The problem I am facing now is how to efficiently store the output of these calculations (data.dat files in atomic style) in an efficient manner. Possibly achieving some degree of compression of my data.
The calculations are tagged by appropriate metadata. So, I was thinking about some structured database solution of some sort.
Do you have any suggestions? Any input is welcome!
If you have your data already organized the way you say, that doesn’t buy you anything, since a file system is a structured database of sorts. Using a full fledged database just hides your data behind its query language and add additional storage needs for indexing.
The first thing you need to worry about is creating additional copies and obtaining checksums, so you can detect if the data is damaged. If you want to save space, you can compress the individual files or store the data on a file system with built in compression like btrfs or zfs.
Hello @akohlmey, thanks a lot for your suggestions.
Do you have specific recommendations of libraries that would help me with that?
I still think that putting the calculation outputs into a database would be helpful as my data are arranged as calculation folders (each with a metadata file), and navigating them is not very easy.
Thanks,
Lorenzo
How would that be any different with a database?
Also, databases are inherently less reliable: if the database is corrupted, everything is lost.
If you have difficulties locating specific files, you can just use standard Unix/Linux procedures.
The simplest would be to run the command ls -R > ls-R (watch out for the spaces) in the top folder and then you can easily look up specific files or folders by searching the ls-R file. If you also want to record the file sizes, use ls -lR > ls-lR.
An alternative would be to install and configure the locate program (or mlocate or plocate), if it is not already installed and configured. This will regularly update an index of all files and then you can use locate <string> to locate all files that contain <string>.
You should be able to compress the datafiles seamlessly by simply adding .gz suffix to the filename in LAMMPS input script (if it was compiled with zlib library - usually is). Moreover, you may want to round atoms’ coordinates and velocities to 4-5 decimal places. This should decrease the size of files further. If your structures are saved in 0K temperature, you can also delete the entire Velocities section.
If you only need positions, LAMMPS can natively output .xtc and .dcd dumps which are optimised for (somewhat lossy) position-only trajectory storage.
Note that LAMMPS can write to .xtc and .dcd without further modifications, but for LAMMPS to read.xtc and .dcd you need to compile with the MOLFILE plugin and point the program at runtime to a suitable Molfile plugin file (but it is not difficult to do). Most modern Python analysis libraries, like MDAnalysis, will read both file formats natively.
If you are having problems organising your data, as in knowing what goes where, then you have two problems. The first problem is having a standardised workflow – if you did, you could easily organise your simulations simply by folder.
Then you would have to set up a file structure to organise those results. I often set up projects with the following structure:
where each scripts folder uses a LAMMPS script with the setup:
variable pname index 42 # default value
shell mkdir ../param${pname}
shell cd ../param${pname}
log log.lammps
This fits the very common use case of launching “one-parameter families” of simulations automatically organising all outputs by parameter value using:
cd scripts
for value in 1 2 5 8; do lmp -i generic.lmp -var pname $value; done
If you need more complicated workflow management I recommend signac as a simple Python library for such things – you can hook it up to cluster job queueing systems reasonably well.
My strategy (which is not an original one at all and not good in every situation) is to avoid as much as possible storing dump files. So far I produced probably a something like 100 TB of data. It is simply waste o disk space, expensive, and hard to handle such an amount.
Once I perform a run, I do immediately a post-processing, extracting these data I need and store that in a some reduced largely volume. However, I do always store log files, so in some rare cases later on I can repeat the computation.