Do not try to guess the simulation box size from xyz files

Currently when loading an xyz file, OVITO seems to try to “guess” the simulation box size if it can’t find this information in the xyz file itself. This is not great because the guessed box size is usually wrong, and the user may be falsefully made to believe that OVITO read the box size successfully.

My opinion:

  1. if the xyz is loaded as the “primary” data source, OVITO should ask user to enter the simulation box size if it can’t be found in the file itself. Maybe with an “I don’t care” option falling back to the current behavior.
  2. if the xyz is loaded as a trajectory, OVITO should only load raw particle coordinates (keeping the simulation box unchanged) if it can’t find the box size in the file.

I’m on OVITO 3.13.0.

Thank you for this suggestion.

Yes, OVITO generates an ad hoc simulation cell for structures loaded from XYZ files based on the bounding box of the particle coordination. This is documented here; see the note in the blue box.

For XYZ files using the extended format, the situation is better, as they typically contain information about the simulation cell, which OVITO reads. As developers, we strongly recommend using only such modern file formats that contain box information and avoiding the legacy XYZ format altogether.

I agree with you, it can be problematic if inexperienced users assume that the box automatically generated for legacy XYZ files is physically meaningful. For example, this can lead to quantitative errors if you use functions that involve the box volume, such as RDF calculation or the Replicate modifier. It would be better if OVITO prevented such user errors in some way.

In my opinion, the best solution is for OVITO not to create and display a simulation box at all if the imported file does not contain any cell information. All analysis functions that strictly require a cell volume (e.g., RDF or structure factor calculation, replicate) will then be completely disabled and cannot be used accidentally. This is a change in behavior that we are considering for OVITO 3.14.0. The generation of an ad hoc volume based on the axis-aligned bounding box would still be available —but only at the explicit request of the user.

You expressed the wish that there should also be a way to enter the dimensions of the simulation cell manually. Could you please explain this usage scenario in more detail? I wonder why you don’t just use a file exchange format that stores the (known) box dimensions. Why is manual transfer to OVITO more sensible - or necessary?

2 Likes

Hi, thanks for the response.

In my opinion, the best solution is for OVITO not to create and display a simulation box at all if the imported file does not contain any cell information.

I agree with this direction.

You expressed the wish that there should also be a way to enter the dimensions of the simulation cell manually. Could you please explain this usage scenario in more detail?

It’s more of a user experience consideration, because currently I found myself defining the simulation cell by Affine transformation after loading xyz which works fine but is not very intuitive.

Regarding the file format, sometimes it’s simply more convenient to work with the xyz format. For example, (afaik) CP2K only supports these trajectory formats:

  1. XYZ
  2. PDB, which is silly for non-biological simulations
  3. DCD, which could be fine but (a) it does not carry atom type information so I have to write another script to convert CP2K input file into something OVITO can read, (b) it is binary so cannot be easily parsed without some heavy dependencies, (c) sometimes OVITO cannot read it for some reason (I may open another report for it later)

Yes, thank you very much for the additional background info. I now understand why it is necessary to manually specify the box size in this case. It seems to be mainly a limitation of the CP2K code, which does not output cell information.

In OVITO, the Affine Transformation modifier will continue to be the hack for manually overriding the cell geometry - as I truly believe that in the year 2025 there should be better ways to transfer this essential info between programs.

Have you considered asking the CP2K developers to make improvements? I think this is where the source of the problem lies. It should be very easy for them to change the code that writes the xyz file and switch to the extended XYZ format, which is fully backward compatible with the legacy XMOL/XYZ format.

It turns out that CP2K includes a tool of converting DCD trajectories to XYZ (not extxyz but includes cell information), so I can’t really ask them to do more I guess.

Also regarding the “sometimes OVITO cannot read DCD” thing, I think it’s because the automatic file type detection of OVITO identify dcd as something else (it asks me to specify column mapping which contains 24 columns), even if the file has .dcd extension. If I manually specify the format in the file selection dialog it works fine.

Would you mind sharing the problematic DCD file with us? Then we could try refining the auto-detection of OVITO to avoid the wrong classification.

If you cannot upload the file here, you can also send it to [email protected]. Thanks.

I think almost every DCD file I have (sometimes not even created by CP2K) has this problem. Anyway, here is an example:

test.dcd (972 Bytes)

Thanks for the DCD file. I found out that OVITO wrongly recognized the file as a binary LAMMPS dump file. This file format is very problematic because it has virtually no unique signature that can be used to identify it with certainty. OVITO therefore has to use a very fragile heuristic, which in this case led to a false positive result for the DCD file.

I have now made the recognition criteria for the binary LAMMPS format even stricter, so that the result of the check is negative and the DCD file reader can take over. This improvement will be included in the next OVITO program release.

Why OVITO does not detect the extension first? I mean the file selection dialog even says that the DCD format has the extension .dcd, and accoring to LAMMPS doc the binary LAMMPS dump file generally has extension .bin or .lammpsbin:

https://docs.lammps.org/dump.html

If the specified filename ends with “.bin” or “.lammpsbin”, the dump file (or files, if “*” or “%” is also used) is written in binary format

Given the limited choice of output formats, I’d transform the DCD into a more expressive format using catdcd, which is shipped with VMD. In your case, start with a dummy PDB file with no box information, but with the correct atom count and type. Then load the PDB file into VMD and create a structure (aka topology) file:

vmd > animate write psf test.psf

That’s the result: test.psf
Finally, convert with (assuming catdcd is available in your $PATH):

catdcd -s test.psf -stype psf -o test.dump -otype lammpstrj test.dcd

Now you are good to go: test.dump

The original reason why OVITO does not use the file extension for the detection is that the LAMMPS developers commonly use reverse notation for file names in their simulation examples, which was adopted by many users. For example, data.coreshell or dump.friction.

But you may be right. Binary LAMMPS files seem to be an exception. Here, LAMMPS itself seems to dictate the file extension. It is not possible to produce such files having another extension. So we could use this as another criterion for the safe recognition of binary LAMMPS files.

That is not quite correct. LAMMPS binary dump files contain a magic string, DUMPATOM, DUMPCUSTOM, or DUMPGRID, for atom style, custom style, or grid style dump files, respectively.

Here is the binary dump file detection for atom and custom style binary dump files that I submitted to the file command maintainers (grid style dump files did not exist at the time):

# Atom style binary dump file for the LAMMPS MD code, https://www.lammps.org
# written on a little endian machine
0         lequad  -8
>0x08     string  DUMPATOM     LAMMPS atom style binary dump
>>0x14    long    x            (rev %d),
>>>0x10   lelong  0x0001       Little Endian,
>>>>0x18  lequad  x            First time step: %lld

# written on a big endian machine
0         bequad  -8
>0x08     string  DUMPATOM     LAMMPS atom style binary dump
>>0x14    belong  x            (rev %d),
>>>0x10   lelong  0x1000       Big Endian,
>>>>0x18  bequad  x            First time step: %lld

# Atom style binary dump file for the LAMMPS MD code
# written on a little endian machine
0         lequad  -10
>0x08     string  DUMPCUSTOM   LAMMPS custom style binary dump
>>0x16    lelong  x            (rev %d),
>>>0x12   lelong  0x0001       Little Endian,
>>>>0x1a  lequad  x            First time step: %lld

# written on a big endian machine
0         bequad  -10
>0x08     string  DUMPCUSTOM   LAMMPS custom style binary dump
>>0x16    belong  x            (rev %d),
>>>0x12   lelong  0x1000       Big Endian,
>>>>0x1a  bequad  x            First time step: %lld

P.S.: also DCD files have a magic byte sequence (CORD) so they can be safely detected.

Hi Axel,

Thanks for pointing this out. Indeed, OVITO does verify the presence of these magic strings, see this location in the source code, but only under the precondition that the first big-int read from the file has a negative value. This is probably to maintain backward compatibility with older binary dump files that didn’t have the magic string yet. This is also why OVITO can still accidentally treat other files as dump files.

I had forgotten about the modernization of the file format you mentioned, so my statement was about the original, unmarked variant of the format.

We could consider removing backward compatibility with very old files from OVITO. Do you know when this change was introduced in LAMMPS? How likely is it that users still want to process and visualize legacy files?

OVITO’s recognition of DCD files actually works robustly. In this case, the problem was that the LAMMPS dump file reader happened to be first in line and claimed the file for itself during the auto-detection process. The DCD reader never got the opportunity to check the file.

According to git, this was changed mid-August 2020.

That is very difficult to answer. Very few people use the binary format. More common is to stick with text format, but use compression. We do occasionally see people use older versions, but rarely 4 or 5 years old. The “modern” LAMMPS era somewhat starts in 2020 with version 3 March 2020 (as part of the LAMMPS development and refactoring push we had at Temple during the COVID-19 stay-at-home phase).

It should be fairly safe to remove backward compatibility with the legacy binary dump revision, since the binary2txt tool in LAMMPS is backward compatible and can convert old style binary dumps into text mode, which can be easily read. In my personal opinion the benefit of safely detecting the file format outweighs the need for backward compatibility. It is still a good idea to mention binary2txt in the documentation.

Thanks for the information, Axel. I think that hopefully no one will still be actively using a LAMMPS version from before 2020. But there may still be people who have saved old dump files that they want to view with OVITO. You’re right, they should be (made) aware that they can also convert them at any time using the binary2txt tool.

I think we can keep the current solution that we implemented in OVITO 3.13.1, as it should be relatively robust: OVITO’s binary dump file reader now only inspects files that end with .bin or .lammpsbin. In addition, the heuristic criteria have been tightened, but legacy dump files are still automatically recognized and can be loaded.

1 Like