CVs systematically evaluated to MAXFLOAT

Hello,

first of all I’m not sure whether this is the right subforum to post this topic, please feel free to move it somewhere else accordingly.

We are confronted with the necessity of truncating a cluster expansion capturing the energetics of adsorption patterns on a surface in the scope of modeling desorption processes. We came to the conclusion that MAPS could be a wonderful tool to achieve this by taking adsorption patterns for binary alloys consisting of hydrogen and vacancies.

We thus generated appropriate lat.in, ref_energy.in, and vasp.wrap files for our surface unit cell, and then invoked MAPS via "nohup maps -d -c0=0.0 -c1=0.49 -2d &". Thus, we calculated a stock of adsorption energies as proposed by MAPS while systematically neglecting adsorption patterns with concentrations >= 0.5 by marking those proposals with an error file in their respective subdirectories. This was done because we are, for now, suspecting only coverages < 0.5 to be relevant to our desorption model. Although we were surprised to be confronted with structure proposals outside of the ground state checking range, we considered it save to neglect them in that way. (However, we do wonder by this time whether this might have had caused side effects we have missed?)
Proceeding this way led us to the following very peculiar state.

According to the redirected MAPS stdout in nohup.out, a vast majority of cluster expansion truncations yielded a CV score of "3.40282e+38" (MAXFLOAT). We traced this down to the function "Real calc_cv(const Array2d<Real> &x, const Array<Real> &y)", implemented in atat/src/lstsqr.c++. This function returns MAXFLOAT, if it finds the denominator to be (too) close to zero for a given structure i with a given weight vector.

This left us puzzled, as thereby a vast majority of truncations are discarded for having a very bad prediction power - completely independent of the actual differences between the DFT energies and the estimated energies using these particular truncations.
We thus added some additional code to MAPS in order to find out which particular structures are those that cause the CV to explode. Deploying the thus patched MAPS and exhaustively removing all those "offending" structures, we finally arrive in a state without any MAXFLOAT CVs.
Those structures however don’t seem to have any obvious features in common, e.g. particularly large relaxations compared to the other, non-offending structures.

Therefore, we’d kindly ask how we could make sure that the "naive" O(n^2) expression (paper on MAPS, page 5, first expression) for the CV and its actually deployed O(n) version (paper on MAPS, page 5, second expression) really are equivalent in the hope that this might help us to find out whether there is a deeper reason for those structures responsible for this issue.

So we still wonder, could there be a deeper reason for those structures to produce a 0 denominator?

Please find attached all the files necessary to reproduce the described behavior:

  • the MAPS working directory "prob_repro/", including the "cv_tnt.log" file containing the blocks of offending structures removed per run at once
  • the altered src/refine.c++ and src/lstsqr.c++ files introducing our additional code into maps from the atat-3.04 toolkit.

Compiling and (un)install should work just as it does for the regular atat-3.04 toolkit. If there are any further questions, I’ll gladly answer them.

Thank you very much in advance!

There are quite a few items to discuss:

  • Maps sometimes generates structures outside of the specified concentration range if the current cluster expansion predicts a ground state that would form a 2-phase equilibrium with ground states that ARE included within the specified concentration range. This prevents surprises when you run Monte Carlo later! Often, it is worth it to let it calculate one of those outside structures just to ensure nothing strange will happen (and maps may leave you alone with those outside structures).
  • The large cv score is "more a feature than a bug". It signals that when some structure is removed from the fit, the least-square problem is underdetermined due to colinearities in the regressors (the correlations). This problem is worse when a lot of structures are manually excluded.
  • There is nothing wrong with the structure(s) that cause(s) the CV score to blow up. It is more a due to the structures that are left and not due to the one that is removed. It is also a function of the correlations that are included in the fit. So it is not a good idea to exclude the structures that cause the CV to blow up.
  • I ran your example (thank you for sending it!) and even though a lot of cluster choices are ruled out, the end result seems not bad: a CV score of 0.014 when your overall energy range is about 0.1 eV is not bad.
  • Just to be sure that the code is not unfairly excluding triplets etc. You can try to run a cluster expansion "manually": use corrdump -clus -2=pair_range -3=triplet_range [etc.]
    to selected your clusters and then do
clusterexpand energy

and do

getclus energy.eci

to see if there are ecis that seem important. Note that this code is tolerant to colinearities: it will just report 0 for an eci that is colinear to other ones (of course, it has to pick one to exclude - it picks the last one).

  • it may be worth trying to include some of the structures you excluded: pick the one that has the lowest predicted energy relative to the convex hull. See file predstr.out
    I hope this is helpful!

Thank you very much for your very fast and elaborated answer! I’m sure your suggestions will be helpful to me and I’ll have a detailed look into all of them.

One thing I’m wondering right now is whether there might be a trick to have better chances to also get the ground state hierarchy of the expansion right. Because, while you are certainly right that the CV score isn’t bad already, the true and predicted ground states don’t agree - which is something that still concerns me.

However, thank you very much once again.

The fact that the true and predicted ground states don’t agree could be due to a predicted GS out of the [0,0.5] concentration range. If not, keep in mind that in systems where the GS line is very complicated (with a lot of competing structures with similar energies), the system tends to disorder at low temperature, so the extra modeling of the GS line is not necessary.

Hello,

by this time I had a look into your suggestions with the following outcomes:

  • As soon as I (re)add the structure of lowest predicted energy relative to the ground state hall (which would be structure 225), I immediately run into the CV score problem again.

  • Concerning running a cluster expansion "manually", I was able to follow the suggested procedure for the example data provided.
    To be exact, I

  • started from the truncation with a CV score of approximately 0.014 which MAPS finds for the "all-included" data set.

  • added 14 additional clusters via "corrdump -2=7.95 -3=2.62 -4=2.62". (This corresponds to additional clusters of diameters up to including third nearest neighbors)

  • calculated the regression with the thus resulting 51x39 regression matrix by invoking "clusterexpand energy".

The corresponding output of "getclus energy.eci", as attached to this post, then shows occasional three- and fourbody ECIs deviating from zero and about as big as some twobody ECIs.

If my interpretation of these findings is correct, then this indeed might be an indication of some kind of exclusion of triplets, etc., right?
So it seems to me that a natural question to ask T this point would be whether there is a way to obtain a CV score (maybe implementing the naive O(n^2) version?) or even a different measure for the prediction power of those problematic truncations?

Before concluding that the 3plets are important, I would suggest running with fewer pairs and the current number of triplets to see what you get.
I am quite sure the O(n^2) vs O(n) algorithm have nothing to do with this behavior.
Clearly there are colinearities in the triplet correlations, because the code report ‘0’ as some of the ECI (it can’t determine them).
If you send me a tar with

lat.in */str.out */energy

I may be able to help better.

Thank you very much for your suggestions, your help is very appreciated.

Concerning the lat.in, */str.out, and */energy files, those are already included in the prob_repro.tar.gz, attached to my first post.

In order to test the influence of the different CV score expressions (O(n) vs. O(n^2)) and a purported exclusion of higher order clusters like triplets, etc. at once, I actually implemented the O(n^2) expression for the CV score.
I then started - using the DFT data already provided - two MAPS runs with the different CV score expressions.
This resulted in the following outcomes:

  • fast, O(n) CV score estimation (detailed information in attached fast-cv.tar.gz):

  • 25 ECIs, all pairwise up to ~7.9 Angstrom

  • true ground states differ from fitted ground states

  • CV = 0.0138175

  • slow, O(n^2) CV score estimation (detailed information in attached slow-cv.tar.gz):

  • 43 ECIs (including 6 ECIs equal to 0), all pairwise up to ~11.3 Angstrom

  • true and predicted ground states agree

  • CV = 0.0241419

I also did corresponding MMAPS runs with the GA provided in the contributions section of this forum to find out to what extend the original approach of MAPS to build truncations might explain this somehow suspicious ubiquity of long pairwise clusters.
The outcomes were:

  • fast, O(n) CV score estimation:

  • 24 ECIs, including 3 triplets

  • true ground states differ from fitted ground states

  • CV = 0.00529884

  • slow, O(n^2) CV score estimation:

  • 29 ECIs (including 2 ECIs equal to 0), including 3 triplets

  • true and predicted ground states agree

  • CV = 0.0056776

So, to me it looks like you were completely right about the exploding CV scores effectively not excluding higher order clusters like triplets, etc and thus provoking the suspicious ubiquity of pair interactions.
However, after all there seems to be a certain impact on the outcomes of runs on the same data resulting from the different CV score measures, so I think my questiosn at the moment would thus be:

  • How can we determine what is the best cluster expansion among those we found?
  • And, probably closely linked to the question above, what exactly do the exploding CV scores signal? I mean, I am aware there are problems inverting the matrix [X^T * X] when there is multicollinearity between columns (clusters), or when there is underdetermination. But this matrix inversion needs to happen anyway, even if the actual CV score will be calculated to MAXFLOAT subsequently. I somehow fail to see what the vanishing denominator descriptively tells me, I guess.

Again, thank you very much for kindly helping along so persistently and fast.

Sorry for the delay (grant reports due…)
When you say that you have implemented the O(n^2) algorithm, are you sure you have implemented the same modified the weighted CV score routine as maps?
The weighting is turned on when the ground states are difficult to fit. This features penalizes the cv contribution of structures at or close to the ground state, so that they are fitted better.
There is also a tolerance (set with -z option) that determines when columns are considered colinear, so you might want to decrease that value (default 1e-4) to see what happens.
So your results can be driven by a lot of other factors than the O(n) algorithm.

I’ll double check if my O(n) code has a bug using your earlier data, but if you have any idea where it could be, please share it with me.

Hello,

What I do to implement the O(n^2) CV algorithm simply is

  • generate new OLS by removing the s-th row from the X matrix (-> new_x), as well as the s-th element from the y (-> new_y) and weight (-> new_weight) vectors,
  • call calc_ols(&sltn,new_x,new_y,new_weight,1) to fit each of these systems
  • call product(&sltn_e,x,sltn) to calculate energies as predicted by the current solution (obtained as described in the former bullet point),
  • add the corresponding contribution from the left out s-th structure to the CV score according to cv += sqr( weight(s)*( y(s)-sltn_e(s) ) ).

The thus generated output reproduces the same CV score as the O(n) code for every truncation (provided that it is not rated MAXFLOAT by the O(n) code, of course) within each of the overall 18 weighting iterations.
For example, spot-checking the output by grepping for a certain non-MAXFLOAT truncation gives

[list=1:2mas5woj]

  • 2 7 0 0 0 0.0385158=O(n)CV, 0.0385158=O(n^2)CV
  • 2 7 0 0 0 0.0517129=O(n)CV, 0.0517129=O(n^2)CV
  • 2 7 0 0 0 0.0630004=O(n)CV, 0.0630004=O(n^2)CV
  • 2 7 0 0 0 0.0735024=O(n)CV, 0.0735024=O(n^2)CV
  • 2 7 0 0 0 0.0791119=O(n)CV, 0.0791119=O(n^2)CV
  • 2 7 0 0 0 0.0859592=O(n)CV, 0.0859592=O(n^2)CV
  • 2 7 0 0 0 0.10329=O(n)CV, 0.10329=O(n^2)CV
  • 2 7 0 0 0 0.103233=O(n)CV, 0.103233=O(n^2)CV
  • 2 7 0 0 0 0.103141=O(n)CV, 0.103141=O(n^2)CV
  • 2 7 0 0 0 0.110676=O(n)CV, 0.110676=O(n^2)CV
  • 2 7 0 0 0 0.11053=O(n)CV, 0.11053=O(n^2)CV
  • 2 7 0 0 0 0.141291=O(n)CV, 0.141291=O(n^2)CV
  • 2 7 0 0 0 0.151886=O(n)CV, 0.151886=O(n^2)CV
  • 2 7 0 0 0 0.178677=O(n)CV, 0.178677=O(n^2)CV
  • 2 7 0 0 0 0.195865=O(n)CV, 0.195865=O(n^2)CV
  • 2 7 0 0 0 0.238019=O(n)CV, 0.238019=O(n^2)CV
  • 2 7 0 0 0 0.274818=O(n)CV, 0.274818=O(n^2)CV
  • 2 7 0 0 0 0.310371=O(n)CV, 0.310371=O(n^2)CV

My conclusion therefore would be that there is no discrepancy between the weighting deployed in my O(n^2) implementation compared to MAPS’ original O(n) implementation.

Decreasing the zero_tolerance parameter from 1e-3 by supplying -z=1e-4 or -z=1e-5 does not change the outcome of a maps run on the given data.

Thanks: this is a very useful description of what is going on.
i see now that the two algorithms differ because when you call
calc_ols( … ,1)
the "1" means remove colinear columns before doing ols at each step.
So it’s doing something beyond the usual CV formula.
This colinearity patch is not done in the O(n) implementation (which gives MAXFLOAT when there are colinearities). I am not sure how to fix this easily… I’ll think about it.