Issues on normalization of RDF

Takeshi_Miyake · October 24, 2024, 12:07pm

The normalization of RDF in matminer/featurizers/structure/rdf.py seems to be somewhat confusing for me.

(1) The peak values of RDF are different in multiples for different supercells of the same crystal (e.g. Silicon),
using the same cutoff.

This issue can be solved by the following replacement:
rdf → rdf/s.num_sites

(2) The peak values of RDF are different in multiples for different bin sizes.

This issue can be solved by the following replacement:
rdf → rdf * self.bin_size

Actually I am not familiar with the normalization of RDF, so I ask for the verification of the above issues. Thanks!

In my experience when we use the expression:
rdf = dist_hist / s.num_sites
we can get the “raw” RDF, in which the first peak value of Silicon crystal is exactly 4 and the second is 12,
no matter what the bin size is. For the normalized RDF, i.e., the “density” of the distribution at a certain radius,
I wonder why the “density” changes with respect to the cell size and the bin size.

Takeshi_Miyake · October 25, 2024, 12:12pm

Using the criterion that the density of distribution at large distance approaches 1.0 for homogenous (amorphous) systems, I finally get the conclusion that the following expression is correct for that in matminer/featurizers/structure/rdf.py:

rdf = dist_hist / shell_vol / number_density / s.num_sites

instead of

rdf = dist_hist / shell_vol / number_density