Interatomic model unique identifier:

Zachary_Trautt · August 29, 2014, 7:12pm

The purpose of this thread is to facilitate discussion of methods for unique identification of interatomic models, to be adopted by NIST, KIM, etc. It has been requested that an end date be placed on discussions, after which a decision must be made. Thus, I propose that discussions remain open to any interested party until 5:00 PM EDT, September 12, 2014.

I will initialize discussions. I propose a system that will identify both the description of an interatomic model and subsequent instantiations:

Description Identifier: E_A_Y_U
E: Element or compound information
Y: Publication year or year of availability
U: Unique string, number, DOI, etc.

Instantiation Identifier: E_A_Y_U_I
I: Additional information relating to the specific implementation of the model.

Examples:
Ni-Al-Co_SmithJJ_2014_SAX1EYJ7
—> Ni-Al-Co_SmithJJ_2014_SAX1EYJ7_KIM_EAM_Dynamo__MO_123412341234_001
—> Ni-Al-Co_SmithJJ_2014_SAX1EYJ7_EAM_Alloy_setfl_MD5_99a175c11698c523c4e6d84dbbcbfd12
—> Ni-Al-Co_SmithJJ_2014_SAX1EYJ7_EAM_Phi_table_MD5_3d48be9074b91eabad33952ba71adc8f

ZnS-H20_DoeJJ_2008_DOI:10.10/101010/1010
—> ZnS-H20_DoeJJ_2008_DOI:10.10/101010/1010_REAX_MD5_00860e77f28b404d47e017f46b809953

Discussion areas:

Identifier components (E, A, Y, U, and I):
Are these sufficient, excessive?

E: Element or compound information
Do we want to simply list elements (Zn-S-H-O) or include more information (Zn-ZnS-H2O-O-H)? Recommendations for convention?

First or all?

Y: Publication year or year of availability
People commonly refer to a model by the author and year, so I think it is important to include both. Thoughts?

U: Unique string, DOI, number, etc
This can be useful in the situation that an author publishes multiple models in the same year and/or for more information, such as paper DOI. Thoughts?

I: Instantiation Identifier
String with few limitations used to describe instantiation. Thoughts?

relliott · August 29, 2014, 8:41pm

The purpose of this thread is to facilitate discussion of methods for unique identification of interatomic models, to be adopted by NIST, KIM, etc. It has been requested that an end date be placed on discussions, after which a decision must be made. Thus, I propose that discussions remain open to any interested party until 5:00 PM EDT, September 12, 2014.

I will initialize discussions. I propose a system that will identify both the description of an interatomic model and subsequent instantiations:

Description Identifier: E_A_Y_U
E: Element or compound information
A: Author information
Y: Publication year or year of availability
U: Unique string, number, DOI, etc.

Instantiation Identifier: E_A_Y_U_I
I: Additional information relating to the specific implementation of the model.

Examples:
Ni-Al-Co_SmithJJ_2014_SAX1EYJ7
---> Ni-Al-Co_SmithJJ_2014_SAX1EYJ7_KIM_EAM_Dynamo__MO_123412341234_001
---> Ni-Al-Co_SmithJJ_2014_SAX1EYJ7_EAM_Alloy_setfl_MD5_99a175c11698c523c4e6d84dbbcbfd12
---> Ni-Al-Co_SmithJJ_2014_SAX1EYJ7_EAM_Phi_table_MD5_3d48be9074b91eabad33952ba71adc8f

ZnS-H20_DoeJJ_2008_DOI:10.10/101010/1010
---> ZnS-H20_DoeJJ_2008_DOI:10.10/101010/1010_REAX_MD5_00860e77f28b404d47e017f46b809953

A couple of basic constraints from the OpenKIM side:

* The general form of our openkim.org ID is:

PREFIX__CC_DDDDDDDDDDDD_VVV

   - PREFIX is limited to 100 characters maximum and only alpha-numeric
     characters (including underscore) are allowed. (In particular, dashes,
     dots, colons, slashes, etc. are NOT ALLOWED)

- The only "double underscore" allowed is the one between the PREFIX and CC
parts.

   - CC is a two-letter alphabetical code describing the KIM Item Type:
   MO - Model
   MD - Model Driver
   TE - Test
   TD - Test Driver
   RD - Reference Data
   VZ - Visualizer
   MV - Model Verification
   TV - Test Verification
   VV - Visualizer Verification

- The DDDDDDDDDDDD is a 12-digit unique decimal number randomly assigned to
each KIM Item

- The VVV is a 3-digit version number starting at 000

* Generally, I would advocate for keeping the entire ID as short as possible.

Discussion areas:

Identifier components (E, A, Y, U, and I):
Are these sufficient, excessive?

* This seems like plenty.

* The order in which these appear is worth considering. These Ids will often show up in alphabetical lists (such as from 'ls -1') and this ordering will be more or less useful depending on the order of these components.

Currently on openkim.org we recommend

<model type/name>_<developer name(s)>_<model_info>_<supported specie(s)>__CC_...

Roughly speaking this corresponds to your: U_A_Y_E_I

This "sorts by model type", where as your ordering, E_A_Y_U_I, "sorts by species/elements".

E: Element or compound information
Do we want to simply list elements (Zn-S-H-O) or include more information (Zn-ZnS-H2O-O-H)? Recommendations for convention?

* OpenKIM would require the use of underscores instead of "-".

* Having the "primary" species first is good since these can be ordered
alphabetically for easy search in a list (as in a file listing a-la "ls -1")

* A simple list will be shorter, but has less information; I don't really have
any strong feeling here.

A: Author information
First or all?

* All is typically not possible. I would probably suggest first and second
with _et_al_ when more that 2 authors exist.

Y: Publication year or year of availability
People commonly refer to a model by the author and year, so I think it is important to include both. Thoughts?

* Agreed, the year is worth having

U: Unique string, DOI, number, etc
This can be useful in the situation that an author publishes multiple models in the same year and/or for more information, such as paper DOI. Thoughts?

* Good as long as we limit to alpha-numeric (with underscores)

I: Instantiation Identifier
String with few limitations used to describe instantiation. Thoughts?

I think you need to say more about what an "instantiation" is. If I understand your meaning, I think OpenKIM would treat each "instantiation" as a separate Item and assign different 12-digit DDDDDDDDDDDD codes to them.

Is that what you have in mind, or something different?

Ryan

Zachary_Trautt · September 2, 2014, 8:30pm

I am most concerned about the description identifier. I think flexibility
is necessary when uniquely identifying a given implementation. For
example, if a developer found a typo in their EAM alloy setfl file for a
published potential, the description identifier would remain the same and
the "corrected" alloy file would have a different string following MD5_.
For this same situation, I think it would be up to KIM editors as to how
assign the KIM ID. If a model had a typo, would you increment the 3-digit
version number or assign a new 12-digit ID?

relliott · September 2, 2014, 8:52pm

I think you need to say more about what an "instantiation" is. If I understand your meaning, I think OpenKIM would treat each "instantiation" as a separate Item and assign different 12-digit DDDDDDDDDDDD codes to them.

Is that what you have in mind, or something different?

I am most concerned about the description identifier. I think flexibility is necessary when uniquely identifying a given implementation. For example, if a developer found a typo in their EAM alloy setfl file for a published potential, the description identifier would remain the same and the "corrected" alloy file would have a different string following MD5_. For this same situation, I think it would be up to KIM editors as to how assign the KIM ID. If a model had a typo, would you increment the 3-digit version number or assign a new 12-digit ID?

In OpenKIM we would increment the 3-digit version number only for this type of typo correction.

Our aim is to have each unique CC_DDDDDDDDDDDD_VVV code have a one-to-one correspondence with a unique set of content. Once released to the public that content would not change in anyway. Thus, from the OpenKIM perspective, having the MD5 checksum in the ID would be redundant (also, we have a plan to include such checksums for each file in the KIM Item package content).

Ryan

Zachary_Trautt · September 2, 2014, 9:36pm

I agree it would be redundant as the MD5 checksum was intended to identify
unique files rather than KIM models.

There is a specific example we could consider: Potential #2 from [M.I.
Mendelev, S. Han, D.J. Srolovitz, G.J. Ackland, D.Y. Sun and M. Asta,
Phil. Mag. 83, 3977-3994 (2003).]

The KIM project appears to have only pulled the corrected version from the
NIST repository as:
EAM_Dynamo_Mendelev_Han_Fe_2__MO_769582363439_000 ---> KIM api 1.5
EAM_Dynamo_Mendelev_Han_Fe_2__MO_769582363439_001 ---> KIM api 1.6
where an archived version does not appear in KIM. I was not part of the
discussion, but the update was sufficient to change properties, such as
equilibrium lattice constant. Notes here:
http://www.ctcms.nist.gov/potentials/Fe.html

This is the kind of situation I have in mind, where the description
identifier remains the same but the implementation is updated. For sake of
simplicity, suppose "EAM_Dynamo_Mendelev_Han_Fe_2" is the description
identifier.

EAM fs setfl file identification:
EAM_Dynamo_Mendelev_Han_Fe_2_MD5_6722233744c98ca5d100d91f4a75a9e4
(archived version
http://www.ctcms.nist.gov/potentials/Download/Fe-MIM/Fe_2.eam)

EAM_Dynamo_Mendelev_Han_Fe_2_MD5_56f278dc4ab8d472fe30db7ac0b88d0c (correct
version http://www.ctcms.nist.gov/potentials/Download/Fe-MIM2/Fe_2.eam.fs)

How would this work if both models were warehoused on KIM?

KIM model identification:
EAM_Dynamo_Mendelev_Han_Fe_2__MO_???_??? —> archived version,
KIM api 1.5
EAM_Dynamo_Mendelev_Han_Fe_2__MO_???_??? —> archived version,
KIM api 1.6
EAM_Dynamo_Mendelev_Han_Fe_2__MO_769582363439_000 —> corrected version,
KIM api 1.5
EAM_Dynamo_Mendelev_Han_Fe_2__MO_769582363439_001 —> corrected version,
KIM api 1.6

relliott · September 3, 2014, 2:40am

Hi Zach, This is a nice explicit example. Let me play out a fictional scenario:

1) Suppose the original version (archived version http://www.ctcms.nist.gov/potentials/Download/Fe-MIM/Fe_2.eam) was submitted to OpenKIM on 13Feb2014 as the Parameterized Model with ID

EAM_Dynamo_Mendelev_Han_Fe_2__MO_123456789012_000

associated with the EAM_Dynamo__MD_120291908751_000 Model Driver compatible with openkim-api-v1.5.0.

2) On 28Jul2014 the kim-api-v1.6.0 was released and on 08Aug2014 the Model Driver was updated to support v1.6.0 and its ID became:

EAM_Dynamo__MD_120291908751_001

3) On the next day 09Aug2014 the Model was updated to use the new version of the Model Driver and its new ID became:

EAM_Dynamo_Mendelev_Han_Fe_2__MO_123456789012_001

In this case, only the Model's Makefile need to be updated in order to update the Model Name and the new Model Driver Name. (In the future this will be something that can be done automatically by the openkim.org system.)

The old version of the Model is marked as "old" or "superseded" and no longer shows up, by default, on the openkim.org list of Models. (However, it is always available from a standard web link.)

4) Now, it is discovered that the model is defective. The maintainer of the model creates the corrected version and is ready to upload. At this point the maintainer has two choices. He/She can (a) upload a new revision of the existing Model. Or (b) he can "fork" the existing model and create a new model.

In case (a): the new version of the model is uploaded and has ID

EAM_Dynamo_Mendelev_Han_Fe_2__MO_123456789012_002

The previous version is marked as "old" and no longer show up, by default, on the openkim.org list of Models.

In case (b): the new version is uploaded and is given a new ID:

EAM_Dynamo_Mendelev_Han_Fe_2__MO_987654321012_000

The openkim.org system will include some form of "history" or "NEWS" file indicating that this model was forked from EAM_Dynamo_Mendelev_Han_Fe_2__MO_123456789012_001. Further, the developer may also request that the model EAM_Dynamo_Mendelev_Han_Fe_2__MO_123456789012_001 be "discontinued" and marked as "old" or "retired" so that it no loger shows up in the default list of models on openkim.org.

5) In either case above, the Wiki for EAM_Dynamo_Mendelev_Han_Fe_2__MO_123456789012_001 can be updated to indicate that the model has been found to give "unintended" predictions and has therefore been retired in favor of the new model.

Ryan

Steve_Stuart · September 4, 2014, 3:22pm

E: Element or compound information
Do we want to simply list elements (Zn-S-H-O) or include more information (Zn-ZnS-H2O-O-H)? Recommendations for convention?

There are a few problems here, but I don’t have any great solutions.

Listing elements only is a problem for any bonded force field. For example, there are plenty of potentials that can simulate H2O but not arbitrary combinations of H and O. On the other hand, listing molecules would become unwieldy very quickly.

I agree with Ryan that having the freedom to list the “important” element or compound first is nice, for sorting / screening purposes.

A: Author information
First or all?

I second Ryan’s suggestion to use Author_et_al for 3 or more authors, which is the convention in print in most fields.

Y: Publication year or year of availability
People commonly refer to a model by the author and year, so I think it is important to include both. Thoughts?

I prefer the year, as there are many cases where the author is not enough, and the year is needed to distinguish between successive versions of a potential. For example, REBO_Brenner_1990_CH and REBO_Brenner_2002_CH.

On the other hand, this isn’t suffiicient, as in two distinct examples that would be described by Tersoff_Tersoff_1988_Si.

And what would be done in cases where there is no publication year? For example, someone who uploads a model prior to publication?

Steve_Stuart · September 4, 2014, 3:42pm

Ryan,

On Friday, August 29, 2014 4:41:21 PM UTC-4, RyanElliott wrote:

PREFIX is limited to 100 characters maximum and only alpha-numeric
characters (including underscore) are allowed. (In particular,
dashes,
dots, colons, slashes, etc. are NOT ALLOWED)

What’s behind the decision to allow underscore as the only non-alphanumeric
character? I can understand the need to exclude some special characters,
but excluding all of them is somewhat inconvenient.

For one thing, it probably hinders attempts to match conventions or
interconvert with other standards like Zach is trying to do.

For another, it presents mild annoyances when constructing KIM
identifiers. The underscore is sort of a field separator, but sort of
not. If we resort to using underscores within a field, like the Zn_ZnS
example, prevents parsing on the underscores to try to sort & categorize
the identifiers. But if we don’t use underscores within a field, then it
makes species lists like ZnZnSH2OOH unreadable, and author lists harder to
read. It would be nice to have a secondary field separator to use within
the model, author, year, etc fields.

-Steve

Zachary_Trautt · September 4, 2014, 4:07pm

What’s behind the decision to allow underscore as the only non-alphanumeric character? I can understand the need to exclude some special characters, but excluding all of them is somewhat inconvenient.

And underscores my be undesirable.

According to:
https://support.google.com/webmasters/answer/76329?hl=en
“We recommend that you use hyphens (-) instead of underscores (_) in your URLs.”

In a youtube video in 2009, google engineer Matt Cutts explained that google treats hyphens (separators) and underscores (ignore) differently.

I don’t know if this has changed. However, I think it means that google will treat:
Cu_Ag__SmithJJ_2004 as CuAgSmithJJ2004
and
Cu-Ag–SmithJJ–2004 as Cu Ag SmithJJ 2004
Which I think means if you search for “Cu Ag 2004” Cu-Ag–SmithJJ–2004 will have a significantly higher rank than Cu_Ag__SmithJJ_2004.

When I came across this, I replaced underscores with double hyphens on my testing web pages such that a potential description with ID:
Cu-Ag_SmithJJ_2004
would have a url of nistPropertySite/Cu-Ag–SmithJJ–2004/

For one thing, it probably hinders attempts to match conventions or interconvert with other standards like Zach is trying to do.

I am not in a position to be hindered. This is the early stage for developing a standard.

relliott · September 4, 2014, 4:16pm

Currently the KIM API does not know anything about Model Names other than they are alpha-numeric (plus underscore) strings. (It does not need to know about KIM ID's and their various parts; It just cares that each Model Name provides a unique string.)

It uses the Model name string as the name of a C function that wraps the model_init function. This gives the API a unique function name for accessing each model.

It is the need to use the Model name as a vaild C function name that currently necessitates the restriction to alpha-numeric plus underscore strings.

I believe that the pipeline code is also taking advantage of this "feature".

There are certainly ways the API could get around this (convert all special characters to underscores for the purpose of creating valid function names, etc.). This would add a little complexity and create a small possibility of having the converted strings be non-unique (two distinct Model names that only differ by their punctuation).

So, if we feel that it is important enough, it is technically possible to implement changes to support a wider set of punctuation. However, it will represent real work that needs to be done by the KIM developemnt Team....

Ryan

relliott · September 4, 2014, 4:17pm

That is interesting. Certainly a point to consider and keep in mind.

Ryan

Zachary_Trautt · September 4, 2014, 5:25pm

E: Element or compound information
Do we want to simply list elements (Zn-S-H-O) or include more information (Zn-ZnS-H2O-O-H)? Recommendations for convention?

There are a few problems here, but I don’t have any great solutions.

Listing elements only is a problem for any bonded force field. For example, there are plenty of potentials that can simulate H2O but not arbitrary combinations of H and O. On the other hand, listing molecules would become unwieldy very quickly.

I agree with Ryan that having the freedom to list the “important” element or compound first is nice, for sorting / screening purposes.

A: Author information
First or all?

I second Ryan’s suggestion to use Author_et_al for 3 or more authors, which is the convention in print in most fields.

It sounds like we have consensus: The elements should appear first and the order is to determined by the developer:
ZnOH_
CuAg_

It also sounds like we have consensus on author format.

Y: Publication year or year of availability
People commonly refer to a model by the author and year, so I think it is important to include both. Thoughts?

I prefer the year, as there are many cases where the author is not enough, and the year is needed to distinguish between successive versions of a potential. For example, REBO_Brenner_1990_CH and REBO_Brenner_2002_CH.

On the other hand, this isn’t suffiicient, as in two distinct examples that would be described by Tersoff_Tersoff_1988_Si.

And what would be done in cases where there is no publication year? For example, someone who uploads a model prior to publication?

I think I need to better explain my idea of modular identification. Suppose a developer uploaded a model to KIM and an alloy file to the NIST IPR prior to publication. The KIM ID might be of the form:
MO_123412341234_000
The alloy file ID might be of the form
EAM_Alloy_setfl_MD5_99a175c11698c523c4e6d84dbbcbfd12
When the description became public (paper in a journal, notes on NIST IPR, wiki on KIM, etc.), a description ID would be assigned. The description ID might be of the form:
CuAg_SmithJJ_et_al_2014
Then, the two implementation identifiers would be updated as:
CuAg_SmithJJ_et_al_2014_MO_123412341234_000
CuAg_SmithJJ_et_al_2014_EAM_Alloy_setfl_MD5_99a175c11698c523c4e6d84dbbcbfd12
Then, suppose the developer realized an incorrect file was sent to both KIM and NIST. The previous identifiers would remain valid for the archived versions. The corrected versions would cause two new implementation identifiers to be generated:

CuAg_SmithJJ_et_al_2014_MO_123412341234_001
CuAg_SmithJJ_et_al_2014_EAM_Alloy_setfl_MD5_3d48be9074b91eabad33952ba71adc8f

If a model is only described on the KIM wiki (or similar), I think it is reasonable to use the year the description was posted.