Anomalous CV score

I’m trying to describe with CE formalism the lateral interactions among epoxy groups on graphene.
Since I found out that nearest neighbour (NN) pairs are unstable in graphene oxide, I’m creating an edited version of the clusters.out file following the procedure suggested in this thread () so that every cluster containing a NN epoxy couple is excluded. As such I cannot use maps but I must rely on lsfit to find the ECIs and the predicted energies.

The lat.in file is constituted by two carbon atoms and three (Vac, O) positions. Clearly, full oxygen coverage is not feasible, therefore I’m focusing on low oxygen concentrations. In particular to build the expansion I’m providing the input structures such that in each structures the epoxy groups arrangement placed in a large supercell reproduce one of the clusters used in the expansion.

To extract the correlation matrix, I collect the str.out of all the input structures in a file (allstr.out) and I run the following command:
corrdump -c -cf=clusters.out -s=allstr.out > allcorr.out
where clusters.out is my modified version of the cluster file that excludes all the NN pairs.

In order to find the best expansion, I run the previous command several times altering the number of clusters included in clusters.out. However, beside the first expansion containing only one pair, all the others lead to the MAX_FLOAT cv score but to predicted energies per site that differ of about 1e-4 eV from the DFT calculated energies.

Looking into the code, this could be caused by the fact that the cv denominator (1-x*(x’x)^-1*x’) is below the zero_tolerance, which is to say that this problem is entirely caused by the correlation matrix.
I cannot understand what is causing this issue. I would greatly appreciate your help.

Thank you

Having the CV score = MAXFLOAT ( about 1e38) just indicates that the CV score cannot be calculated because there are not enough energy data points (for the given number of clusters), perhaps because some structures have colinear correlations.

The fact that you get an excellent fit (error ~ 1e-4) suggests to me that you are in a situation where the number of energies is equal to the number of clusters. In this case, the CV score cannot be calculated because as soon as you remove one point, the fit is underdetermined.

Solution:
If you are happy with your cluster expansion for other reasons (e.g., because it give physically reasonable results) you can ignore the problem. The risk of using so few data points is just that you have no estimate of out-of-sample error (data not included in the fit).
To eliminate this problem you just need more energies or fewer clusters.