the new read_data command

Jacob_Gissinger · December 2, 2015, 7:41pm

Hello Steve,

I have been looking at the new read_data command. It’s awesome, but my original interest in this functionality included to avoid requiring prescience (e.i. stating ‘extra’ values in the first read_data) in order to add new atom types or number of bonds to existing atoms or restart files. Would it be possible to transfer the heavy lifting of printing out the data file and modifying by hand to the LAMMPS source code?

Thanks,

Jake

akohlmey · December 3, 2015, 2:26pm

Hello Steve,

I have been looking at the new read_data command. It's awesome, but my
original interest in this functionality included to avoid requiring
prescience (e.i. stating 'extra' values in the first read_data) in order to
add new atom types or number of bonds to existing atoms or restart files.
Would it be possible to transfer the heavy lifting of printing out the data
file and modifying by hand to the LAMMPS source code?

LAMMPS *requires* that certain properties (e.g. number of atom types)
are locked in when the simulation box is defined.
this happens, when the fist data file is read. changing this behavior
would require a massive code change.

on the other hand, nothing keeps you from defining (many) more types
and reserving space (outside of concerns for efficiency).

in general, what you are trying to do is to essentially use LAMMPS as
a pre-processing tool to build an initial structure.
that is *much* more easily and cleanly done with an external and
specific tool. the requirements for topology and system building and
efficiently running a (data) parallel simulation are pretty much
orthogonal.

axel.

sjplimp · December 3, 2015, 2:31pm

Not sure what you mean by this:

Would it be possible to transfer the heavy lifting of printing out the data file and modifying by hand to the LAMMPS >source code?

The way it works now, when you read the first data file (or use the create_box command), you have

to establish the model you will define, i.e. # of atom (bond, angle, etc) types. This is so

internal data structs can be setup in the code. Then you can read multiple data files and populate

your model. Doing this doesn’t require editing any data files (which may have been used as-is for
other kinds of models). It only requires adding args to the various read_data (or create_box) commands

in your input script to specify how to add the info in the data file.

That seems like a reasonable burden to put on the user?

Steve

akohlmey · December 3, 2015, 2:58pm

Not sure what you mean by this:

Would it be possible to transfer the heavy lifting of printing out the data
file and modifying by hand to the LAMMPS >source code?

The way it works now, when you read the first data file (or use the
create_box command), you have
to establish the model you will define, i.e. # of atom (bond, angle, etc)
types. This is so
internal data structs can be setup in the code. Then you can read multiple
data files and populate
your model. Doing this doesn't require editing any data files (which may
have been used as-is for
other kinds of models). It only requires adding args to the various
read_data (or create_box) commands
in your input script to specify how to add the info in the data file.

That seems like a reasonable burden to put on the user?

steve,

i think your are missing the point. with all those new features
(including those to create bonds etc.) people are now tempted to use
LAMMPS as a topology *building* tool, not knowing how tricky it is to
do the things they want to do in parallel with distributed data. that
is why i am so concerned about adding these features and particularly
always urge people to use external topology creation tools. there are
huge benefits to that. trying to do this all from inside LAMMPS is in
my personal opinion a bad idea, but it is difficult to explain without
knowing how things work internally.

that being said, if somebody absolutely wants to do that, the way to
do it without having to change data files would be to define the box
*first* using the create_box command and set the number of
atom/bond/angle/dihedral/improper types and other counts at the
beginning of the input and to "safe" values that are larger than any
expected combination of data files will amount to. that may waste
space and make LAMMPS run less efficient, but addresses the problem in
a way that won't require making already complex and difficult to
maintain code even more complex.

the total number of types for each entity can always be larger than
the number of types in use. it would also be wise to then set all per
type properties (masses potential coefficients) to "safe" values.

axel.

sjplimp · December 3, 2015, 3:11pm

with all those new features
(including those to create bonds etc.) people are now tempted to use
LAMMPS as a topology building tool, not knowing how tricky it is

Totally agree - I’m not advocating people should not use a builder designed to

do complex things. But if you want to take 2 or a handful of existing

data files that already have topology info and merge them to perform

one simulation, we tried to provide the hooks to do that.

The resulting LAMMPS commands do

require you think carefully about which atoms (and bonds, etc) should

become which atom types, and you sum up the total # of types you will have,

so that you have a correct merged model with all the coeffs and cross-type coeffs

defined. But LAMMPS can’t guess some of those things, like if the water molecules in 2 different

data files are the same or different. That’s what the user needs to specify.

Steve

Jacob_Gissinger · December 3, 2015, 5:41pm

I would agree with Steve, combining systems in-house can be a powerful (clean and quick) tool.

This more obviously diverges from ‘building the system’ when at least one of the systems is to be run in lammps separately first. In this case, the ‘read_data add’ option would become more relevant if it could be used after a read_restart command. This would require being able to previously tell lammps to reserve space for extra atom types the same way you can tell it to reserve space for a certain number of bonds per atom. Axel, did you suggest this was already possible?

Ignorant of the coding issues, it would also be great to be able to automatically add any read_data to any read_data or restart file (which would be essentially assuming that all incoming atom types, etc. are new to the system).

akohlmey · December 3, 2015, 6:28pm

I would agree with Steve, combining systems in-house can be a powerful
(clean and quick) tool.

This more obviously diverges from 'building the system' when at least one of
the systems is to be run in lammps separately first. In this case, the
'read_data add' option would become more relevant if it could be used after
a read_restart command. This would require being able to previously tell
lammps to reserve space for extra atom types the same way you can tell it to
reserve space for a certain number of bonds per atom. Axel, did you suggest
this was already possible?

not with restarts. here is the basic rundown of the available options
and facilities:

- restarts should only be used for continuing simulations. at best you
can switch fixes or force field parameters.

- everything else should be done with data files via read_data, and/or
create_box, create_atoms.

- restart files can be converted into data files with the -r command
line flag, or with read_restart followed by write_data. or you can
simply also do write_data directly at the end of your preparatory
simulations. data files are by far more portable and backwards
compatible than restarts, too.
those data files will contain velocities, so there isn't anything
important lost.

- with create_box you can reserve sufficient extra space for all kinds
of topology data. please see the manual.

- so create_box followed by multiple read_data add is pretty much the
only option to combine topologies that does not require changing the
individual files, provided you set the corresponding offsets
accordingly. if you want that step to be done automatically, you
*have* to do it with an external tool.

Ignorant of the coding issues, it would also be great to be able to
automatically add any read_data to any read_data or restart file (which
would be essentially assuming that all incoming atom types, etc. are new to
the system).

that is not desirable for the technical reasons i already outlined. it
also won't work in the generality of force fields that LAMMPS supports
as it implicitly assumes that all mixed force field parameters between
the different atom types can be inferred from mixing rules. for a
large number of simulations that people run with LAMMPS, this is not
given, so this kind of feature would require a very complex operation
to be implemented and would still only benefit a few users. there are
many things that would be nice to have, but here is a limitation that
is due to the basic design of LAMMPS that cannot be easily removed
without massively impacting the performance. since the same operation
is rather easy to do externally, there is no good argument to do it
from inside LAMMPS.

in fact, merging multiple topologies from data files under the
assumption that you have a class 1 force field with global mixing for
LJ parameters and need to check atom/bond/angle/dihedral/improper
types for overlap is pretty straightforward to program in a scripting
language like perl, python or Tcl. while features in LAMMPS should
work for many different cases, you can write that script to only
support your special use case and that would eliminate almost all of
the complications.

here is some pseudo code for it.

program takes multiple data files as arguments.
open all files and parse the header part. record the total number of
entities (atoms, bonds etc.) and types and the offsets for each file.
open the output file and write out a suitable combined header

now process each section, again looping over each file and apply the
predetermined offsets for types and ids, while writing the augmented
data to the new file. that requires some familiarity with list
manipulations in the scripting language of choice, but either perl,
python or tcl are well suited for that.

to summarize.
- type counts and other topology related parameters in LAMMPS are
locked in when the box is created; either directly or when a restart
or data file is read. nobody will change that, as that is at the very
core of how LAMMPS works and needed to run efficiently.
- automated combining of topologies only works with additional
constraints in place (e.g. the kind of force field). that is often
very project specific and then really easy to program as an external
tool for that particular use case. trying to shoehorn this into LAMMPS
itself is not a good idea and would be a massive effort.

when looking at the how lammps is implemented. the only way how i
could imagine to make this works would be to implement a variant of
the "clear" command, lets call it "suspend", that would create a new
instance of the LAMMPS class, but not delete the old instance of the
LAMMPS class, instead it keeps a pointer to it available and assign it
a label. then you would basically do the same thing as with setting up
(and running) multiple independent consecutive simulations, but rather
than using clear to wipe out the old version, you keep it and then you
could have a "combine" command to build yet another new instance of
the LAMMPS class with the combined data from multiple suspended
instances, or possible also a "switch" command that would cycle
between those suspended instances. as you probably see from this
description, this doesn't sound as easy as writing a (small) script to
combine data files.

axel.

Jacob_Gissinger · December 3, 2015, 7:03pm

The complications in your last paragraph might be avoided if lammps didn’t jump into creating an instance of itself, but instead (contrary to current lammps dogma) first scanned the whole input file to tally the size needed for relevant data structures.

sjplimp · December 3, 2015, 7:30pm

first scanned the whole input file to tally the size needed for relevant data structures.

Not possible. An input script can have branches (if/then/else) and they might

depend on results of an earlier computation. So there is no way a priori to

determine everything a script might do.

I suggest you propose something concrete that you’d like to do, but currently

either cannot, or that it’s not as easy as you think it should be.

Note that reading in restart files mixed with data files is probably hopeless.

a) a restart file can always be converted to a data file

b) a restart file has potentially lots of other info that would

make little sense to try and preserve if you are altering the system with new atoms

for example, initial atom coords for a compute msd for mean-sq displacement to

continue seamlessly after a restart

Steve

akohlmey · December 3, 2015, 7:52pm

this is cannot work. after a command is executed (i.e. a line of input
processed), the changes to the state it promises to do, have to be
fully committed. how else can you know whether there will be another
read_data command or not?
the point where types are committed is when you create the simulation
box, and you cannot read in coordinates without a simulation box.
so there is no way to read in a data file speculatively and (maybe)
commit it later.

axel.

system · December 3, 2015, 7:53pm

I think my only concrete suggestion is the hopeless: basically adding a read_data option to the read_restart command, or equivalently the ability to reserve extra atom types, (to be remembered by restart files).

first scanned the whole input file to tally the size needed for relevant data structures.

Not possible. An input script can have branches (if/then/else) and they might

depend on results of an earlier computation. So there is no way a priori to

determine everything a script might do.

I suggest you propose something concrete that you’d like to do, but currently

either cannot, or that it’s not as easy as you think it should be.

Note that reading in restart files mixed with data files is probably hopeless.

a) a restart file can always be converted to a data file

b) a restart file has potentially lots of other info that would

make little sense to try and preserve if you are altering the system with new atoms

for example, initial atom coords for a compute msd for mean-sq displacement to

continue seamlessly after a restart

Steve

system · December 3, 2015, 8:04pm

I believe this is exactly what the new read_data options do

Original Message

akohlmey · December 3, 2015, 8:08pm

please explain what would be the benefit of reading a restart in this
context over reading a data file.

once you read a data file, you can just as well use an external tool
that combines data files, and you sidestep *all* complications.
the add option is meant to be used for a specific purpose that steve
has described and you can do the reservation of types from the input
script using create_box, as i have outlined to you.

i repeat, writing a small tool that combines data files is trivial to
implement by comparison to what you are asking for. yet you have so
far failed to provide a convincing argument in favor of your
suggestion that demonstrates a benefit over what can be done with
merging data files. that proposed tool can be run from the input
script via the shell command you can do clear and move forward.

or let me ask differently: what is it exactly, that *cannot* be done
with merging datafiles with an external program?

axel.

akohlmey · December 3, 2015, 8:09pm

I believe this is exactly what the new read_data options do

no.

Jacob_Gissinger · December 3, 2015, 8:17pm

I agree that this (as is) is not adding new functionality to lammps, merely reducing the operating man-hours of its users. I believe this tool you are mentioning, which I/users already have in different languages, could be incorporated into lammps in a general fashion.

akohlmey · December 3, 2015, 8:33pm

I agree that this (as is) is not adding new functionality to lammps, merely
reducing the operating man-hours of its users. I believe this tool you are

...and instead wasting many more hours of developers on implementing
this and maintaining it?
that is not what i would call a convincing argument.

mentioning, which I/users already have in different languages, could be
incorporated into lammps in a general fashion.

it *cannot*. such a tool cannot be written to work in all generality.
LAMMPS is far too flexible for this and there are many caveats and
problems that you don't have encountered. it is easy to write a
merge-my-data-files tool for a particular use case. we are going in
circles now. please look over the points that steve and i have made in
this discussion.

unless you find a better argument than just saving (rather little)
time. i would claim that a script code in the fashion that i have
outlined before can be implemented in python or perl or tcl in less
than a day for anybody with a reasonable proficiency in those
languages.

i am certain that steve will be more than happy to include contributed
versions of such an implementation to the tools directory and have
them distributed with LAMMPS. that is the clean way to solve such a
problem.

axel.

Jacob_Gissinger · December 3, 2015, 9:04pm

i think this is certainly merely for user convenience (barring future creative use cases). however, it may be nontrivial convenience not least for users who are scared away from writing their own codes for use via shell commands (most users).

akohlmey · December 3, 2015, 9:42pm

i think this is certainly merely for user convenience (barring future
creative use cases). however, it may be nontrivial convenience not least for
users who are scared away from writing their own codes for use via shell
commands (most users).

this is a rather irrational argument. ...and i honestly don't see what
is so scary about writing some script code.

when you are writing LAMMPS input scripts, you are doing the same
thing, and with a rather clumsy and sometimes even awkward or ad hoc
syntax and semantics. most script languages are simpler (perhaps with
the exception of APL).

if you want to do *any* kind of smart analysis or post-processing of
your simulation data, you'll *have* to do programming at the very same
or even higher level of complexity. if you want to make creative use
of simulation programs, you *have* to be able to do that kind of
programming, too. if you don't have that kind of skill, i advise you
to obtain it. there are plenty of good tutorials around. python is
probably the most widely use script language in this kind of context
and arguably the easiest to learn.

axel.

athomps · December 3, 2015, 10:00pm

If there is something that is hard for the user to do (without
learning to program), there is nothing wrong with making it easy, as
long as it can be done in a general way that is not too disruptive to
the LAMMPS code. However, after reading all of Jacob's comments, I
could not identify anything that falls in that category. All of the
requested items are either already supported, or can be accomplished
by a different work flow (e.g. generate data file from restart file),
or they are not very specific.

What specific thing is it that would save user-hours and is not
currently supported?

Aidan

Jacob_Gissinger · December 3, 2015, 11:14pm

Hello Aidan,

I believe I’ve run out of fresh arguments, but my vision is: combining two (possibly iteratively evolving) systems within lammps. Specifically: a robust ‘read_data add’ option to the read_restart command.

Jake