Topic of the Month: YAML support in LAMMPS

akohlmey · April 29, 2022, 6:20pm

Dear LAMMPS users and developers,

High time to try start another open ended discussion. This time I would like to discuss about some recently added features in LAMMPS. Support to creating outputs in YAML format for easy reading and post-processing with python or other script languages. This originally started from discussions on adapting the example files in the LAMMPS distribution for regressions testing.

One of the challenges is to reliably extract only the thermodynamic data for creating a reference and storing and analyzing them efficiently. There is a logfile analyzer tool written in Python as part of Pizza.py that can do the extraction part, but it is rather fragile and not fully compatible with the needs for testing. Based on what was written in the first part of 8.3.8. Output structured data from LAMMPS — LAMMPS documentation the idea arose to add a new thermodynamic output style that does the YAML format automatically. The structure of YAML files also makes it easier to extract only the contents in YAML syntax and skip over contents that are not thermodynamic output.

Another contributing factor was the request to have the option to rename the column headers, especially for computes and fixes, so they will be more descriptive of what data they contain. Combined with some necessary refactoring to modernize the code and get rid of some complexity those changes were all implemented recently.

There are two ways to enable YAML style thermo output. a) Use thermo_style yaml where you get a fixed set of properties similar to the default output, b) Use thermo_style custom followed by thermo_modify line yaml.

Now extracting and plotting the data is extremely simple in Python when using the pyaml, pandas, and matplotlib modules.

import re, yaml
import pandas as pd
import matplotlib.pyplot as plt
# extract YAML format part from log file
docs = ""
with open("log.lammps") as f:
    for line in f:
        m = re.search(r"^(keywords:.*$|data:$|---$|\.\.\.$|  - \[.*\]$)", line)
        if m: docs += m.group(0) + '\n'
thermo = list(yaml.load_all(docs, Loader=CSafeLoader))
# convert list of list to a pandas data file and plot
df = pd.DataFrame(data=thermo[0]['data'], columns=thermo[0]['keywords'])
fig = df.plot(x='Step', y=['E_bond', 'E_angle', 'E_dihed', 'E_impro'], ylabel='Energy in kcal/mol')
plt.savefig('thermo_bondeng.png')

thermo_bondeng

In combination with the thermo_modify colname option to rename columns, creating a plot of the thermodynamic data in high quality should be very easy. Certainly easier than with the older tools.

In addition, we also now have a dump style yaml that can import data in a similar fashion, and work on fix ave/time and other averaging fixes to support YAML format output has started.

What is most compelling to me about these feature is that this is built entirely on well supported widely used support software (pyyaml, pandas, numpy, matplotlib) and since pandas uses numpy storage underneath it is also very effective and fast for processing large amounts of data.

What do people think about this?
Do you see any other applications that can be built on top of this, or parts of LAMMPS that could benefit from interfacing with YAML format data?
Are there alternatives worth looking into?

akohlmey · April 29, 2022, 9:16pm

To demonstrate the “portability” of using YAML as intermediate format, here is an example for extracting YAML data from the log in perl:

use strict;
use warnings;
use YAML::XS;

open(LOG, "log.lammps") or die("could not open log.lammps: $!");
my $file = "";
while(my $line = <LOG>) {
    if ($line =~ /^(keywords:.*$|data:$|---$|\.\.\.$|  - \[.*\]$)/) {
        $file .= $line;
    }
}
close(LOG);

# convert YAML
my $thermo = Load $file;

# convert hash references members to real arrays to simplify the following code
my @keywords = @{$thermo->{'keywords'}};
my @data = @{$thermo->{'data'}};

# print first two columns
print("$keywords[0] $keywords[1]\n");
foreach (@data) {
    print("${$_}[0]  ${$_}[1]\n");
}

P.S.: does anybody know something equivalent to matplotlib in perl?

hothello · May 4, 2022, 8:25pm

Hi Axel,

this is fantastic news! Data management is a big issue and the YAML support is going to facilitate data flow in complex simulations. I guess this will also facilitate the use of semantic technology for MD simulations carried out with LAMMPS.

For a quick check, I use to pipe commands directly to GNUPLOT, eg:

open GP, '| /usr/bin/gnuplot';
syswrite(GP, "p '"$tmp_file"' w l
pause -1\n");

rkingsbury · May 23, 2022, 3:17pm

@akohlmey , thank you very much for making us aware of this exciting new feature (and apologies for my belated reply). I work with several colleagues on building high-throughput workflows based on LAMMPS, and the ability to output data in a structured format like this will be immensely helpful.

I personally think yaml is an excellent choice for this. It may also be worthwhile to support .json - I believe there are some differences in how quickly one can load and compress yaml vs. json using standard python tools, and now that you have .yaml output support I think conversion to .json should be straightforward. Just for information, in the Materials Project we realy heavily on .json and have recently moved towards the orjson library rather than the standard json library because it is more standards-compliant and substantially faster. I’d encourage you to test against orjson if you decide to add .json support at a later time.

Could you elaborate on

In addition, we also now have a dump style yaml that can import data in a similar fashion, and work on fix ave/time and other averaging fixes to support YAML format output has started.

If I understand correctly, there will soon be an option to export the trajectory data itself to .yaml, is that right? That would also be very exciting and immensely helpful to automated workflow development.

akohlmey · May 23, 2022, 3:31pm

Probably not. YAML has the advantage to be a bit more forgiving and flexible in how to format the output (which would explain the slower parsing) and that was crucial for implementing YAML output without many changes to the flow of control in LAMMPS. Mind you the output is done directly with explicit formatting, not by using some library. This is necessary for efficiency reasons. You are welcome to look at the source code and make suggestions. If JSON output can be done in a similar fashion, there would be big problem to add it.

Dump style yaml and YAML format output for fix ave/time is available in the lastest LAMMPS release:

The dump style implements a feature that is also available from the netcdf dump style and that is to include the thermo_style output in the dump as well. Examples for the output structure are given in the corresponding documentation pages.

akohlmey · May 23, 2022, 3:56pm

If performance in processing data matters, then text based formats should be avoided at all.
In LAMMPS you can configure writing NetCDF style output or there is a dedicated HDF5 based dump style. Both offer standard compliant and much faster processing of data (writing and reading).

rkingsbury · May 24, 2022, 8:51pm

I see; I had assumed it was done using some library, so it makes sense that adding .json support would be a bigger effort.

Very exciting! Was not aware.

Thanks; I have not tried these. I see that it’s possible include thermo data in the NetCDF file along with trajectory info too; that’s also quite appealing.