YAML support in Ovito

Oystein · December 29, 2023, 9:33pm

Hello

LAMMPS can output trajectory files in yaml format now, which is quite convenient for many purposes. However, it seems Ovito cannot read yaml format trajectory files yet. Any plans on implementing this in Ovito?

stukowski · January 1, 2024, 6:08pm

Until now, we did not know for sure whether this file format is already used by OVITO users. Thank you for asking. We are now working on it and it won’t be long until OVITO can load LAMMPS dump yaml files. I’ll keep you posted.

stukowski · January 2, 2024, 2:05pm

We now have a first preview version of OVITO providing support for the LAMMPS dump yaml format. Your feedback would be appreciated.

https://www.ovito.org/download/testing/ovito-basic-3.10.1-dev-HEAD-626b8a23-win64-dev3.10.1.exe
https://www.ovito.org/download/testing/ovito-basic-3.10.1-dev-HEAD-626b8a23-x86_64-dev3.10.1.tar.xz
https://www.ovito.org/download/testing/ovito-basic-3.10.1-dev-HEAD-626b8a23-macos-arm64-dev3.10.1.dmg

Oystein · January 2, 2024, 2:51pm

Great! Thanks a lot for the quick response.

I have tested the new preview version and it seems to be working well based on my very initial testing. However, things are generally going quite a bit slower than with regular dump files (made using the custom-style of the dump command), e.g. playing the video or some simple processing like generating trajectory lines. I don’t know the reason for this, or if it is simple to improve it. I do have the impression from working/processing other yaml files that they (the yaml library?) are bit slower compared to using just pure python to process data files.

stukowski · January 2, 2024, 3:36pm

Thanks for the quick feedback. I had expected that reading YAML files would be a bit slower than reading regular LAMMPS dump files (see also this discussion by Axel Kohlmeyer). But I haven’t measured it in benchmarks so far and I am surprised that there is such a big speed drop as you describe. I’ll have a closer look at it. What is the typical data size of your simulation snapshots (in terms of particle count and data column count)?

OVITO uses the Rapid YAML framework, which supposedly is very fast, to parse the YAML structure and then extract the specific information of the LAMMPS format from the in-memory representation. OVITO always reads one simulation timestep at a time, i.e., file sections delimited by “—” and “…” elements. When opening a trajectory dump file, OVITO first scans the entire YAML file to generate an index of byte offsets where individual frames start, for quick random access during animation playback.

Oystein · January 3, 2024, 8:16am

I was also surprised that is was so slow compared to my regular dump files.

This system contains 44184 atoms and the columns id, mol, element, type, x, y, z, ix, iy, iz (10 columns), which is quite normal for my systems. The yaml file is 5.1 GB which is significantly bigger than a corresponding custom-style dump file of 3.1 GB with the same data. I am guessing that is simply because of extra commas, square brackets and various other formatting?

stukowski · January 3, 2024, 8:58am

Yes, yaml dump files turn out bigger than regular dump files (because of the extra commas and brackets), which is not helpful for the loading performance. But the real limiting factor is not the increased disk I/O, but the significant work required to parse the yaml syntax in memory. The Rapid YAML framework already does a good job and is several times faster than any other yaml parser out there. However, according to its developer, the maximum achievable parsing speed is only approx. 150 MB/s (without disk I/O). This is due to the relatively complex syntax rules of the general YAML format that must be applied during processing.

I have made a few more optimizations in the OVITO file reader in order to at least quickly transform the YAML data into OVITO’s internal representation. But the loading time is by far dominated by the parsing of the YAML syntax. I’m afraid there’s no quicker way of reading these files, in principle. The YAML format is not made for performance.

I ran a benchmark with a simulation of 640k atoms, in which 12 data columns get dumped per atom. The time required by OVITO to read a single timestep varies considerably depending on the file format you use:

dump yaml: 940 ms
dump custom: 105 ms
dump netcdf: 18 ms

So, if you care about performance, you should use the netcdf dump format. In terms of file size the netcdf format is also the most compact one (followed closely by the regular dump custom format). Another advantage of the dump netcdf format is that it gives OVITO direct random access to all frames of the trajectory. No scanning/indexing of the file is necessary on first load.

This is the slightly optimized version of the yaml file reader, but the difference is hardly noticeable:

https://www.ovito.org/download/testing/ovito-basic-3.10.1-dev-HEAD-1b793b69-macos-arm64-dev3.10.1.dmg
https://www.ovito.org/download/testing/ovito-basic-3.10.1-dev-HEAD-1b793b69-x86_64-dev3.10.1.tar.xz
https://www.ovito.org/download/testing/ovito-basic-3.10.1-dev-HEAD-1b793b69-win64-dev3.10.1.exe

Oystein · January 3, 2024, 9:03pm

I was not aware that netcdf files were so much quicker than dump custom. Interesting and good to know! Thanks for the information.