Atom typing error when building a styrene dimer using a user-defined force field

Dear developer,

I’m trying to build a styrene dimer. In the esh file, If I give the full SMILES of the dimer, it works.

#!/usr/bin/env emc.pl

ITEM	OPTIONS

replace         true
density         0.1
number          true
field           opls-aa  # type of FF to use
field_name      emc-ff
field_location  .
pdb_compress    false
suffix          "_linux_x86_64"
emc_execute     true
emc_depth       6  # speed up ring searching

ITEM	END

ITEM    GROUPS

POL     CC(c1ccccc1)CC(c1ccccc1)

ITEM    END

ITEM    CLUSTERS

pol     POL,1

ITEM	END

But If I treat the styrene dimer as a polymer and make use of the POLYMERS feature, emc failed with atom typing error

#!/usr/bin/env emc.pl

ITEM	OPTIONS

replace         true
density         0.1
number          true
field           opls-aa  # type of FF to use
field_name      emc-ff
field_location  .
pdb_compress    false
suffix          "_linux_x86_64"
emc_execute     true
emc_depth       6  # speed up ring searching

ITEM	END

ITEM	GROUPS

INI       *[H]
STY       *CC(*)c1ccccc1,1,INI:1,1,STY:2
TER       *[H],1,STY:2

ITEM	END

ITEM    CLUSTERS

pol     alternate,1

ITEM    END

ITEM    POLYMERS

pol
1       STY,2,INI,1,TER,1

ITEM    END

EMC failed with the following error

Info: field = {mode -> apply, style -> none, error -> true, debug -> false,
        check -> {atomistic -> true, charge -> true}}
Info: applying './emc-ff.prm'
Warning: no rule found for {group, site} = {STY, 0}.
Warning: no rule found for {group, site} = {STY, 3}.
Error: core/fields.c:433 FieldsApply:
       Missing rules.
       Program aborted.

In these two cases, the same top file is used for atom typing.

ITEM    RULES
# id type element residue atom charge rule
1    c3     C     -    -    0    C(C(H)(H)(C))(^6c(:^6c)(:^6c))(H)(H)
2    c3     C     -    -    0    C(C(H)(H)(H))(^6c(:^6c)(:^6c))(H)(C(C)(H)(H))
3    c3     C     -    -    0    C(C(^6c)(H)(C))(H)(H)(H)
4    c3     C     -    -    0    C(C(^6c)(H)(H))(H)(H)(C(C)(^6c)(H))
5    ca     C     -    -    0    ^6c(:^6c(:^6c)(H))(:^6c(:^6c)(H))(H)
6    ca     C     -    -    0    ^6c(:^6c(:^6c)(H))(:^6c(C)(:^6c))(H)
7    ca     C     -    -    0    ^6c(:^6c(C)(:^6c))(:^6c(:^6c)(H))(H)
8    ca     C     -    -    0    ^6c(C(C)(H)(C))(:^6c(:^6c)(H))(:^6c(:^6c)(H))
9    ca     C     -    -    0    ^6c(C(C)(H)(H))(:^6c(:^6c)(H))(:^6c(:^6c)(H))
10   ha     H     -    -    0    H(^6c(:^6c)(:^6c))
11   hc     H     -    -    0    H(C(C)(H)(C))
12   hc     H     -    -    0    H(C(C)(H)(H))
13   hc     H     -    -    0    H(C(C)(^6c)(C))
14   hc     H     -    -    0    H(C(C)(^6c)(H))

ITEM    END

Considering that these two esh files are describing exactly the same molecule, I cannot figure out why one works while the other does not.

I have attached the esh files and corresponding top and prm files for your reference.

emc-styrene.zip (2.4 KB)

Dear user,

You issue is caused by your list not being extensive enough to capture the molecule you are trying to type. You should not forget that EMC is trying all possible permutations of your group connectivity, even though when a certain moiety does not appear in your final configuration. You can check what EMC exactly does when typing your system by running your .esh with -field_debug=reduced, e.g.

./emc-polymer.esh -field_debug=reduced >/dev/null

which will output reduced field debugging output to build.out.

To remedy your issue, I would suggest to generalize your rules according to the following:

ITEM	RULES

# id	type	element	residue	atom	charge	rule

0	c3	C	-	-	0	C(~C)(C)(C)(C)
1	c3	C	-	-	0	C(~C)(C)(C)(H)
2	c3	C	-	-	0	C(~C)(C)(H)(H)
3	c3	C	-	-	0	C(~C)(H)(H)(H)
4	ca	C	-	-	0	^6c(:^6c)(:^6c)(*)
5	ha	H	-	-	0	H(^6c)
6	hc	H	-	-	0	H(C)

ITEM	END

The ~ in the above refers to a generalization of the hybridization, i.e. any type of bond is allowed, including partial bonds. Also, the number of ring members is ignored. The * character refers to a wildcard representing any type of element. Please note that I replaced your spaces by tabs. EMC can be somewhat finicky about the use of spaces as separators. With a single tab between each column you are always on the safe side.

I see you chose nanometers as a unit of length. Are you trying to use GROMACS afterwards?

Dear developer,

Thank you for the detailed explanation. Using more generalized typing rules indeed resolves this specific styrene dimer case. However, I am still struggling to understand the underlying mechanism of atom typing in EMC, especially when using the POLYMERS feature.

My use case is: I want EMC to assign atom types for an arbitrary molecule for which I already know the exact atom types, so I deliberately define very explicit chemical environments in the RULES section rather than relying on generic patterns.

With my originally posted emc-polymer.esh and emc-ff.top files, I followed your suggestion and reran EMC using -field_debug=reduced. EMC reports: Warning: no rule found for {group, site} = {STY, 0} Inspecting build.out, the chemical fragment used for atom typing of STY:0 is:

STY:0   C(C(^6c)(H))(C(^6c)(H))(H)(H)

It appears that EMC constructs the typing fragment from the atoms wrapped in red in the attached figure. The atom being typed (STY:0) is marked in blue.

My confusion is: Given ffdepth = 2, I would expect the leftmost green-circled carbon atom to be included in the fragment, but it is not present in the generated chemical pattern.

Could you elaborate on how EMC defines the neighborhood used for atom typing when POLYMERS feature is enabled?

Thank you very much for your time!

Dear user,

EMC checks all permutations of all connections that can be made – given the provided connectivity. Most likely, you are missing rules for certain permutations. For illustrative purposes, I redefined your emc-ff.prm and emc-ff.top to include four extra types.

emc-ff.prm (2.6 KB)
emc-ff.top (302 Bytes)

Below are highlighted are masses, equivalences, and rules:

# Mass section

ITEM	MASS

# type	mass	element	ncons	charge	index	comment

c4		12.0100	C	0	0	0	SP3 carbon, no hydrogens	
c4h		12.0100	C	0	0	0	SP3 carbon, one hydrogen
c4h2	12.0100	C	0	0	0	SP3 carbon, two hydrogens
c4h3	12.0100	C	0	0	0	SP3 carbon, three hydrogens
c3a		12.0100	C	0	0	0	SP2 aromatic carbon
h1a		1.0080	H	0	0	0	Hydrogen to aromatic carbon
h1		1.0080	H	0	0	0	Hydrogen to SP3 carbon

ITEM	END	# MASS

# Equivalence section

ITEM	EQUIVALENCE

# type	pair	incr	bond	angle	torsion	improper

h1		h1	h1	h1	h1	h1	h1	
c4		c4	c4	c4	c4	c4	c4	
c4h		c4	c4	c4	c4	c4	c4	
c4h2	c4	c4	c4	c4	c4	c4	
c4h3	c4	c4	c4	c4	c4	c4	
c3a		c3a	c3a	c3a	c3a	c3a	c3a	
h1a		h1a	h1a	h1a	h1a	h1a	h1a	

ITEM	END	# EQUIVALENCE

# Rules section

ITEM	RULES

# id	type	element	residue	atom	charge	rule

0	c4		C	-	-	0	C(~C)(C)(C)(C)
1	c4h		C	-	-	0	C(~C)(C)(C)(H)
2	c4h2	C	-	-	0	C(~C)(C)(H)(H)
3	c4h3	C	-	-	0	C(~C)(H)(H)(H)
4	c3a		C	-	-	0	^6c(:^6c)(:^6c)(*)
5	h1a		H	-	-	0	H(^6c)
6	h1		H	-	-	0	H(C)

ITEM	END	# RULES

I have used the equivalence section for linking new types which are equivalent to a central type, e.g. c4h2 to c4. My naming convention is based on the number of connections and number of connected elements, e.g. c4h2 represents a carbon which has four connections of which two are hydrogens. You will see that carbons are typed based on their connected hydrogens. When using these changes in combination with your setup.esh (setup.esh (687 Bytes))

# Options section

ITEM	OPTIONS

replace			true
density			0.1
number			true		# interpret column 3 of SHORTHAND or
							# CLUSTERS as the number of desired
							# clusters
field			opls-aa  	# type of FF to use
field_name		emc-ff
field_location	.
pdb_compress	false
pdf_licorice 	true
focus 			true
emc_execute		true
emc_depth		6			# speed up ring searching

ITEM	END	# OPTIONS

# Groups section

ITEM	GROUPS

INI			*[H]
STY			*CC(*)c1ccccc1, 1,INI:1, 1,STY:2
TER			*[H], 1,STY:2

ITEM	END	# GROUPS

# Clusters section

ITEM    CLUSTERS

polymer		alternate,1

ITEM    END	# CLUSTERS

# Polymers section

ITEM    POLYMERS

polymer
1			STY,2, INI,1, TER,1

ITEM    END	# POLYMERS

you can see the permutations for group STY in its branches under the (* Groups *) section in the produced setup.emc.gz:

branches -> {
        {site -> 0, nconnects -> 2, connects -> {
            {index -> INI, site -> 0, connect -> 0b1, mass -> {c4h3, h1}, 
              charge -> -0.0259, increment -> -0.0259}, 
            {index -> STY, site -> 3, connect -> 0b10, mass -> {c4h2, c4h}}}}, 
        {site -> 3, nconnects -> 2, connects -> {
            {index -> STY, site -> 0, connect -> 0b1, mass -> {c4h, c4h2}}, 
            {index -> TER, site -> 0, connect -> 0b1, mass -> {c4h2, h1}, 
              charge -> -0.0259, increment -> -0.0259}}}}

where you can see that both site 0 and 3 have two branches, where site 0 branches to groups INI and STY, and site 3 to groups STY and TER. This means that site 0 needs a rule for c4h2 and c4h3, while site 3 needs a rule a rule for c4h and c4h2. Having just one rule will not suffice for both cases. Typing information follows at the end of each branch in the form of mass, charge, and increment. EMC will complain if you omit any of the rules needed to type each of aforementioned permutations.

The resulting setup.psf reflects the different carbon types:

      34 !NATOM
       1 poly 1    STY  C    c4h3         -0.0777         12.01       0
       2 poly 1    STY  H1   h1            0.0259         1.008       0
       3 poly 1    STY  H2   h1            0.0259         1.008       0
       4 poly 1    STY  C1   c4h          -0.0161         12.01       0
       5 poly 1    STY  H11  h1            0.0259         1.008       0
       6 poly 1    STY  C2   c3a          -0.0098         12.01       0
       7 poly 1    STY  C3   c3a          -0.1364         12.01       0
       8 poly 1    STY  H31  h1a           0.1364         1.008       0
       9 poly 1    STY  C4   c3a          -0.1364         12.01       0
      10 poly 1    STY  H41  h1a           0.1364         1.008       0
      11 poly 1    STY  C5   c3a          -0.1364         12.01       0
      12 poly 1    STY  H51  h1a           0.1364         1.008       0
      13 poly 1    STY  C6   c3a          -0.1364         12.01       0
      14 poly 1    STY  H61  h1a           0.1364         1.008       0
      15 poly 1    STY  C7   c3a          -0.1364         12.01       0
      16 poly 1    STY  H71  h1a           0.1364         1.008       0
      17 poly 2    STY  C    c4h2         -0.0518         12.01       0
      18 poly 2    STY  H1   h1            0.0259         1.008       0
      19 poly 2    STY  H2   h1            0.0259         1.008       0
      20 poly 2    STY  C1   c4h2          -0.042         12.01       0
      21 poly 2    STY  H11  h1            0.0259         1.008       0
      22 poly 2    STY  C2   c3a          -0.0098         12.01       0
      23 poly 2    STY  C3   c3a          -0.1364         12.01       0
      24 poly 2    STY  H31  h1a           0.1364         1.008       0
      25 poly 2    STY  C4   c3a          -0.1364         12.01       0
      26 poly 2    STY  H41  h1a           0.1364         1.008       0
      27 poly 2    STY  C5   c3a          -0.1364         12.01       0
      28 poly 2    STY  H51  h1a           0.1364         1.008       0
      29 poly 2    STY  C6   c3a          -0.1364         12.01       0
      30 poly 2    STY  H61  h1a           0.1364         1.008       0
      31 poly 2    STY  C7   c3a          -0.1364         12.01       0
      32 poly 2    STY  H71  h1a           0.1364         1.008       0
      33 poly 3    INI  H    h1            0.0259         1.008       0
      34 poly 4    TER  H    h1            0.0259         1.008       0

Your quoted rule

STY:0   C(C(^6c)(H))(C(^6c)(H))(H)(H)

relates to atom with index 20, which is an end carbon. Please note that ffdepth = 2 refers to a maximum depth of 2. EMC will produce the chemical environment of an atom to be typed. EMC sorts rules internally with complex rules first and the most general rule last. The strategy for creating rules is to have simple general rules for more occurring types and more complex rules of types with lower frequencies.

In the following image, carbon atoms in the backbone are annotated with their indices in the PSF file and their assigned atom types using your provided typing rules.

Based on this, I believe that my originally posted emc-ff.top file already covers these four atoms through the following rules:

# id type element residue atom charge rule
1    c3     C     -    -    0    C(C(H)(H)(C))(^6c(:^6c)(:^6c))(H)(H)
2    c3     C     -    -    0    C(C(H)(H)(H))(^6c(:^6c)(:^6c))(H)(C(C)(H)(H))
3    c3     C     -    -    0    C(C(^6c)(H)(C))(H)(H)(H)
4    c3     C     -    -    0    C(C(^6c)(H)(H))(H)(H)(C(C)(^6c)(H))

1:c4h3 is described by rule 3: C(C(^6c)(H)(C))(H)(H)(H)

4:c4h is described by rule 2: C(C(H)(H)(H))(^6c(:^6c)(:^6c))(H)(C(C)(H)(H))

17:c4h2 is described by rule 4: C(C(^6c)(H)(H))(H)(H)(C(C)(^6c)(H))

20:c4h2 is described by rule 1: C(C(H)(H)(C))(^6c(:^6c)(:^6c))(H)(H)

I may be misunderstanding this point, but according to the chemical pattern itself, the central atom is bonded to two C(^6c)(H) fragments and two H atoms, which does not correspond to an end carbon environment.

I assume you have some code which creates rules automatically and need to know what all to include. You can see how EMC applies your rules by executing your previously provided polymer example as

emc.pl emc-polymer.esh -field_debug=reduced

which is the same as adding the command line option to your ITEM OPTIONS section. This will create output in reduced format and states, that you are missing rules for

STY:0	C(C(^6c)(H)(H))(H)(H)(H)

core/field/apply/rules.c:222 [*, *]
Debug: core/field/rule.c:448 6	ca	0	^6c(:^6c(:^6c)(C))(:^6c(:^6c)(H))(H)
Debug: core/field/rule.c:448 7	ca	0	^6c(:^6c(:^6c)(C))(:^6c(:^6c)(H))(H)
Debug: core/field/rule.c:448 8	ca	0	^6c(:^6c(:^6c)(H))(:^6c(:^6c)(H))(C(C)(C)(H))
Debug: core/field/rule.c:448 9	ca	0	^6c(:^6c(:^6c)(H))(:^6c(:^6c)(H))(C(C)(H)(H))
Debug: core/field/rule.c:448 5	ca	0	^6c(:^6c(:^6c)(H))(:^6c(:^6c)(H))(H)
Debug: core/field/rule.c:448 2	c3	0	C(^6c(:^6c)(:^6c))(C(C)(H)(H))(C(H)(H)(H))(H)
Debug: core/field/rule.c:448 1	c3	0	C(^6c(:^6c)(:^6c))(C(C)(H)(H))(H)(H)
Debug: core/field/rule.c:448 4	c3	0	C(C(^6c)(C)(H))(C(^6c)(H)(H))(H)(H)
Debug: core/field/rule.c:448 3	c3	0	C(C(^6c)(C)(H))(H)(H)(H)
core/field/apply/rules.c:390 match = 0x0
core/field/apply/rules.c:407 type[0] = >empty<

STY:3	C(C(H)(H)(H))(C(H)(H)(H))(^6c(:^6c)(:^6c))(H)

core/field/apply/rules.c:222 [*, *]
Debug: core/field/rule.c:448 6	ca	0	^6c(:^6c(:^6c)(C))(:^6c(:^6c)(H))(H)
Debug: core/field/rule.c:448 7	ca	0	^6c(:^6c(:^6c)(C))(:^6c(:^6c)(H))(H)
Debug: core/field/rule.c:448 8	ca	0	^6c(:^6c(:^6c)(H))(:^6c(:^6c)(H))(C(C)(C)(H))
Debug: core/field/rule.c:448 9	ca	0	^6c(:^6c(:^6c)(H))(:^6c(:^6c)(H))(C(C)(H)(H))
Debug: core/field/rule.c:448 5	ca	0	^6c(:^6c(:^6c)(H))(:^6c(:^6c)(H))(H)
Debug: core/field/rule.c:448 2	c3	0	C(^6c(:^6c)(:^6c))(C(C)(H)(H))(C(H)(H)(H))(H)
Debug: core/field/rule.c:448 1	c3	0	C(^6c(:^6c)(:^6c))(C(C)(H)(H))(H)(H)
Debug: core/field/rule.c:448 4	c3	0	C(C(^6c)(C)(H))(C(^6c)(H)(H))(H)(H)
Debug: core/field/rule.c:448 3	c3	0	C(C(^6c)(C)(H))(H)(H)(H)
core/field/apply/rules.c:390 match = 0x0
core/field/apply/rules.c:407 type[3] = >empty<

by indicating a zero pointer 0x0 and >empty< for the found type. The essence of the above is that both connection points have been replaced with your TER group, which is a single hydrogen. As I stated before, EMC tries all permutations. This means one has to consider

[TER][STY][TER]
[TER][STY][STY]
[STY][STY][STY]

My set of rules capture these boundary conditions, as where yours do not. You would need four extra rules, i.e.

15	c3	C	-	-	0	C(C(H)(H)(H))(C(H)(H)(H))(^6c(:^6c)(:^6c))(H)
16	c3	C	-	-	0	C(C(H)(H)(H))(^6c(:^6c)(:^6c))(H)(H)
17	c3	C	-	-	0	C(C(^6c)(H)(H))(H)(H)(H)
18	c3	C	-	-	0	C(C(^6c)(H)(H))(C(^6c)(H)(H))(H)(H)

You can figure these out by using -field_debug=reduced over a few iterations.

Yes, this is exactly what I’m trying to do.

However, I’m seeing different output on my side, and I think this is where my confusion comes from. After running EMC with -field_debug=reduced over several iterations, the missing patterns reported in my case are:

STY:0   C(C(^6c)(H))(H)(H)(H)
STY:0   C(C(^6c)(H))(C(^6c)(H))(H)(H)
STY:3   C(C(H)(H))(C(H)(H))(^6c(:^6c)(:^6c))(H)
STY:3   C(C(H)(H))(^6c(:^6c)(:^6c))(H)(H)

These patterns do not seem to match the missing patterns shown in your example output, and in particular I do not see a pattern that corresponds to [TER][STY][TER].

I have attached my build.out from the first EMC run for reference, in case it helps identify where our observations diverge.

build.out (31.0 KB)

That is odd. There is a hydrogen missing as you also can see in line 137 and onwards of the provided build.out. EMC creates rules internally from the provided groups and matches rules based on these groups. Something seems to go wrong with your version of EMC. I do not recall having fixed a problem related to your observation since the EMC in August of last year. However, my current version does not seem to suffer from these issues as you can see from the excerpts of my build.out in my previous post. A fix would be to follow what EMC tells you it needs as a rule. Not all that satisfactory, I agree. I will fix the issue and incorporate it in the next release.