LAMMPS ontology

Quote from another topic.

This knowledge could be captured with an ontology that allows annotating simulation parameters with metadata. This approach could lead to black boxes whose purpose is not to do scientific research but to produce data consistently. Are you aware of any such project for LAMMPS?

Hmm… what would the metadata look like for “I did it this way, because it always works for me”, or “when I use this combination of settings for that kind of purpose the system will usually behave better”, or “and if it doesn’t I can play with the XXX setting until it does”?

Or put differently, how can you formalize intuition and a years of experience?

It is already a good starting point to provide a formal system to reproduce someone’s else results. On top of that, one could start by describing what a parameter does—for instance, choosing a tight or loose coupling constant depending on the system size.
A formal system could help define when a system has reached thermal equilibrium or how long a simulation should run to get meaningful results. This is the knowledge that an ontology should provide, but the ultimate validation comes from comparing the (annotated) output with reference data.

To document and archive reproducible results is - for example - a goal of the OpenKIM project, which has a forum on this server as well. There are some other database related categories here as well, so you may want to take a look at the available categories.

However, what this is missing a core point of my thoughts. These can only provide the “what”, not the “why”. …and in part because the “why” is not so easy to write down.

This is why I believe that people who ask “can you give me an input to do XXX so I can learn from it?” are not 100% honest. Because from an input without commentary (i.e. the “why”) you don’t really learn that much from. You can gather the same information from the documentation, which comes with examples, too (it is just not a runnable example, some assembly is required, so there is some extra effort needed). So why not study the docs right away? and work out examples on your own?

Perhaps at some stage people will be able to ask some AI software like chatGPT “what does an input for simulation of XXX with LAMMPS look like?” because it extracted the “why” as hidden information from many “whats”. But then again, different people have different approaches to building a simulation and most of those are more-or-less equally valid, so how could an AI pick that up?

The following papers (and references and citers) might be of further interest: Effects of thermostats/barostats on physical properties of liquids by molecular dynamics simulations - ScienceDirect https://royalsocietypublishing.org/doi/10.1098/rsta.2020.0082 Testing for physical validity in molecular simulations

The fundamental problem, though, is that it is easy for me to write a paper about constant potential simulations of capacitors, and much harder for me to write a paper about how it only works with kspace accuracies of at least 1e-6 – all the usual publication biases are at play here (positive publication bias, little incentive to share scripts or data or provide validity or reproducibility testing, lack of funding for basic vs applied research, etc.)

If every paper that cited LAMMPS attached LAMMPS input scripts in the supplementary, I think the model training would be trivial …

As far as I know there are no “magical consensual recipes” and without a grasp of what you try to achieve, some learning about the system and general knowledge about MD, you might end-up applying inappropriate heuristics without realising it. A paper I like illustrating how tricky it is to get consensual heuristics on convergence is this one. Even if the method tested is unreliable, it is interesting to see how the proportion of opinions changes with the experience of the surveyed people. This makes me think that getting general methods that could get general approval about convergence is more tricky than it looks. I really have a hard time figuring what a “molecular dynamics ontology” would cover or not.

From another perspective, Wong-ekkabut and Kartunnen’s provocative review reminds that it is not only about reproducible results but also having a good grasp at the physics and what to expect from simulations. So “Meaningful results” is tricky in the sense that playing around with parameters might lead to “meaningful while meaningless” results. To quote their conclusion:

No matter how simple the simulation, the user must always check and validate all the parameter (as well as protocol) choices even if they have been used extensively before. Breaking the second law of thermodynamics like in Case Studies 1 and 2 demonstrates that anything is indeed possible and that such unphysical results may look very exciting. As with many other things in life, if something looks too good, it most likely is. This also explains why so many bad, uninformed, or sometimes old, choices still remain in simulation protocols: validation is time consuming and thankless. […] In the worst case, something appears to be wrong or suspicious and finding the origin may be extremely time consuming and tedious as anyone who has tried it knows very well.

As they show (which goes further into @srtee direction), the literature is already very dense with tests of parameters and methods but there appear to be more incentives to publish non-physical results than get a good grasp at MD. I think there might still be something missing on the community level that would be in-between normalised practices (tests, reproducible practices, general discussion of methods etc.) and (new)users. But I know a lot of people in the community do their best to tackle this issue.

Thank you for sharing your thoughts on this topic, and for the pertinent citations. I must disclose that I am following the development of the Elementary Multiperspective Material Ontology (EMMO), and how to use it to semantically describe modelling workflows.

For sure, one can use a workflow manager to ensure reproducibility of a complex simulation, but this is no guarantee that the outcome is physically sound. I see the use of ontologies as a way to agree on terms and relations: e.g., if I say on my paper that a production time of 10 ns was analysed after reaching thermal equilibrium, this may be a perfectly valid choice if I am interested in radial distribution functions, but completely inadequate if I am computing the dielectric constant of -say- water.
Common terms can have a different meaning depending on the context: an atom is a featureless object in a classic MD simulation, has an electronic structure in QM, and is a even more complex object in particle physics.

I am not advocating for replacing the craftsmanship of materials modelling with some advanced form of AI. As you pointed out, validating a certain parameter choice depends on a good understanding of the physics of the system, a great deal of previous experience, determining the convergence of energy and forces variationally, (plus, there is little to no incentive to publish these critical details). A good ontology should provide a common vocabulary and rules to agree on the description of methods, and to ensure their consistency with best practices.