Early cluster choice-loop termination in MAPS?

christopher · March 14, 2025, 6:15pm

Hello,

I currently find myself altering some code within MAPS, mainly for testing purposes. I noticed the lines


if (cv==MAXFLOAT) {
	  Array<int> cur_choice;
	  get_cluster_choice(&cur_choice);
	  int i=2;
	  for ( ; i<cur_choice.get_size(); i++) {
	    if (cur_choice(i)!=0) break;
	  }
	  if (i==cur_choice.get_size()) break;
	}

within the loop iterating over all cluster choices (that are to be tested according to MAPS’ hierarchical selection rules). This code causes an early break out of that loop for a given test set of (two dimensional) structures for a cluster choice consisting of only 2 point- and 29 pair-clusters.

I wonder if someone could explain to me what the rational behind this is? I don’t get why it’s clear - at that point - that there is no use in looking at more cluster choices.
Might this has something to do with the arising of guaranteed collinearities due to the lengths of the largest pairs in conjunction with the dimensions of the structures in this particular test set? Or am I completely mistaken here?

As this left me quite puzzled for some time now, any suggestions would be greatly appreciated.

Thank you very much in advance and best regards
christopher

avdw · March 14, 2025, 6:19pm

The rationale is that if a cluster choice yield a cv==MAXFLOAT it means that there are not enough structure to reliably fit that number of clusters. Any cluster choice including those clusters as well as other clusters will yield the same result, hence the loop can be aborted.

christopher · March 14, 2025, 6:25pm

Thank you very much for your rapid response!

I thought about your answer, but I am afraid I did not completely get your point. :?

First of all, I think I don’t quite understand what

actually precisely means.
At first, I figured this means there are (multi-)collinearities among the clusters within the current set of structures. This also would have matched my interpretation of

as those collinearities couldn’t possibly disappear just by adding more clusters. However, I realized that possible collinearities are properly dealt with by removing linearly dependent columns from the correlation matrix before calculating the CV scores.
Also, looking at the denominator of the CV score expression, I realized the correlation matrix X actually does change due to additional, non-collinear cluster columns - how is it that it’s clear that the product still evaluates to 1 for at least one structure i?

Another point I stumbled over is: Wouldn’t your argument also hold for every other, cv = MAXFLOAT evaluating cluster choice?
For example, assuming there is a cluster choice that leads to cv = MAXFLOAT while including n triple clusters. Wouldn’t that also mean that there is no point in looking at cluster choices including n or more triple clusters?

I hope I’m not annoying you already, it just would be great for me to finally get this right.

Best regards
christopher

avdw · March 15, 2025, 11:41pm

sorry, my answer was not clear. Actually the piece of code checks if there only pairs in CE that lead to cv==MAXFLOAT. In that case (and only that case) can we conclude that all future cluster choices will contain a superset of those pair clusters. Clearer?

christopher · March 15, 2025, 11:43pm

Thank you very much for your clarification. I’m afraid, I haven’t made my point clear enough though, sorry. So let me try it in the following way:

I think I already got the following part of the argument:
Every cluster choice will include at least as many pair clusters as those choices tried before. Thus, a cluster choice containing only point- and pair-clusters will be a subset of every future cluster choice to be tried.
So, if such a point- and pair-cluster choice evaluates to MAXFLOAT, all future choices are expected to evaluate to MAXFLOAT, as they comprise the former, first MAXFLOAT-choice as a subset.
I hope that’s, roughly, the reasoning, right?

Yet, what I didn’t get is the following:
"Why" does the MAXFLOAT CV arise?
Clearly, if I just look at the denominator in the CV expression, I can see that X_i * inverse(transpose(X)*X) * transpose(X_i) = 1, where X_i is the i. row (containing all cluster correlations in the i. structure) of the matrix of all correlations X.
However, I now fail to connect this with the statement

There are two possible ways I could think of in that matter:

Too many clusters for the current set of structures. This is easily ruled out, as MAPS for logical reasons ensures the number of structures is larger than the number of clusters.
(Multi-)Collinearities among the cluster choice in the given structure set. However, I think I can rule that out too: The implemented logic ensures that the matrix X above has full rank when passed to the CV evaluation. So by definition, looking at X at the time the CV is evaluated, we couldn’t even tell if the current cluster choice had any collinearities in the first place, as they would have been removed by now anyhow.

Also, I don’t understand why we can be sure that X_i * inverse(transpose(X)*X) * transpose(X_i) = 1 for subsequent cluster choices, as X does change when additional, non-collinear clusters are added. This means I don’t see how

could actually be guaranteed.

So, in essence, I think I simply fail to see the representation of

and

in terms of linear algebra or, more specifically, ordinary least square fitting/cv score evaluation. Could you maybe give me a hint in that regard?

Thank you very much in advance, and best regards
christopher

avdw · March 15, 2025, 11:53pm

Colinearity can occur at two stages. I am sure you know some of this, but I am trying the write a self-contained explanation.

When all structures are included, maps checks for colinearity for the current cluster choice (in the search loop) and eliminates colinear columns. You can see when this happens by the fact that some ECI are 0.
During the calculation of the CV score, one structure at the time is removed from the fit the calculate the prediction error for that structure (the code actually has an optimized way to do this, to avoid recalculating all matrices for for each eliminated structure). It could happen that a colinearity appears only when the code removes one structure form the fit. This is when the MAXFLOAT result for the CV comes up. (In the efficient formula for the CV, this situation shows up as a zero denominator.)
Of course, it would be very computationally demanding to check for this, which is why it is not done. Instead, I assigned cv=MAXFLOAT to give preference to cluster choices where this does not happen.

This being said, I would be curious to hear if you get better performance by fixing this problem in a different way. For instance, perhaps just skipping the structure removals that lead to colinearity in the cv score calculation may be a way to go?

christopher · March 16, 2025, 12:03am

Thank you very much, your reply was extremely helpful to me!

So basically, I can rest assured now that, when there are collinearities in the reduced fits,

And, of course,

I wonder though, would you know of any source that actually proves this? It’s not that I wouldn’t believe that is the case, it’s just that I myself don’t see how this is guaranteed from looking at the expression in the denominator and I also failed to find such a source until now.
So I’m a bit suspicious whether it might actually be really obvious and I just keep failing to see it …

After all, I did some tests to convince myself that this is what happens when CV = MAXLOAT. And Indeed, most of the time a zero denominator in the efficient formula for a given offending structure coincided with a reduced column rank in the correlation matrix with the offending structure row left out.
However, occasionally there were zero denominator evaluations when the corresponding reduced matrix did not show an additional collinearity. So to me, it seems aside from additional collinearities in the reduced matrices, there is at least another source of MAXFLOAT CVs.

Considering what to do alternatively, I think I’ll go with using the efficient formula and switch to the inefficient one only for those structures which evaluate to MAXFLOAT.
Or is there a reason not to do so, aside from runtime penalty?

Best regards
christopher