Clarification about the retrieved “predicted_folds” and the “best_models” when running a benchmark in automatminer

Jorge_Alonso_Delgado · September 7, 2020, 8:20am

Hello everyone,

In this opportunity I would like want to expose a few questions that I got when trying to understand well the benchmark method in automatminer, and the retrieved “predicted_folds” list and the “best_models” dictionary.

For instance (as a case of study), I’m running a benchmark (debug preset, kfold=5). At the end I use the obtained predictions of the predicted_folds to calculated rmse’s for each fold:

pipe.learner.best_pipeline:

Pipeline(memory=Memory(location=/tmp/tmpwgh7yz9_/joblib),
steps=[(‘selectpercentile’, SelectPercentile(percentile=57,…)
(‘maxabsscaler’, MaxAbsScaler(copy=True)),
(‘extratreesregressor’,…)

I understand that the “best_pipeline” corresponds to the “global best pipeline” which is the one that provides the lowest test fold error. In this example the “best_pipeline”, is therefore a pipeline identified during fold[4].

Here my first question, the predictions of the other folds (0, 1, 2, 3), are obtained applying other “local best pipelines”, obtained in the corresponding fold[n]? If so, then those “local best pipelines” might use different ML-algorithms (not necessarily etr, but others). In that case, it makes no sense to average the errors of the five folds, because each fold error, is calculated using a different pipeline.

My second question is related to the dictionary obtained in “pipe.learner.best_models”. In the present example,

pipe.learner.best_models:

OrderedDict([(‘ExtraTreesRegressor’, -6.724502205567023),
(‘RandomForestRegressor’, -6.959273286928392),
(‘DecisionTreeRegressor’, -7.233240134394501),
(‘ElasticNetCV’, -7.334800528561516),
(‘LassoLarsCV’, -7.429971666602927),
(‘GradientBoostingRegressor’, -7.568739134833447)])

According to this dictionary the “greater_score_is_better” (in this case, ExtraTreesRegressor (etr) is the best model). This agrees with the fact that the best_pipeline also involves etr as the ML-algorithm.

My question is, how are the “pipe.learner.best_models” scores obtained? Are they accumulated during autoML (in the internal loop of each fold), or they are cumulative scores obtained during the prediction step at the end of each fold (in the outer loop)?
Is it correct to use these scores to make comparisons between different pipelines (e.g. obtained with other presets? –>“greater_score_is_better”)

I have these doubts, because I’ve noticed that after the fitting of MatPipe (at the end of each fold), the consecutive prediction is very fast, and I was wondering, if only “the best local pipeline” is being used for predictions, or if several “best local pipelines” (one for each ML-algorithm) are tested, and the averaged or unified score is finally stored in “best_models”.

Below you can see a scheme I’ve prepared, to illustrate what “I understand” is happening during benchmark method in automatminer:

I would really appreciate any clarification/correction of what I’m presenting here. I’ve been struggling in collecting information from the documentation of automatminer/TPOT but haven’t managed to locate these details.

Regards,

Jorge

ardunn · September 12, 2020, 7:45pm

Hey Jorge,

This is correct. They have local best pipelines for each fold. Best_pipeline and best_model only reflect the last fitted fold, not all folds.

Philosophically, this is because a pipeline transformer can only have 1 state; that state determines exactly what is will do when fit/transform is called. Re-fitting thus voids the state.

This is incorrect. We are evaluating the algorithm - the process by which we go from training data to inference. Whether the models are the same are irrelevant, since in AutoML we treat the selection of model type as a kind of internal hyperparameter search done only on internal validation data.

In other words, our “pipeline” is the Automatminer fitting process, and the underlying models are Automatminer’s “hyperparameters”, tuned on internal training data only. So while we can’t average the fold scores and use them to say “An extra trees model would get a mean NCV RMSE of 9.1”, we can say “The Automatminer model got a mean NCV RMSE of 9.1; it frequently selects an extra trees model.”

You can also compare these scores to other models (e.g., other Automatminer presets, graphnets, etc.) as long as they use the identical NCV procedure. Whatever an algorithm - whether it be a graph network, Automtminer, or some more traditional ML model - does within a training fold to select its final model is not relevant to the final NCV score, as long as the final model for that fold is determined solely from training (not test) data.

The best model is determined only at training time by TPOT. These are ranked according to the internal validation score within a single fold. (as in your figure, the data on the lower left). The only model that ever sees the test fold is best_pipeline, determined during training time for a single fold. Important note: the “best models” might not generalize well to the test fold if the algorithm doesn’t evolve well.

Your question does bring up an interesting point; it might be worthwhile to accumulate these models over the folds and store them somewhere. Or maybe to make benchmark a separate class/function where the accumulated matpipes can be held. I’ll make an issue for it on the repo.

To use the “best_models” scores? No, absolutely not. Those are internal validation scores. They absolutely cannot be used for estimating generalization performance.

To use the RMSE, averaged across all test (external) folds? Yes. Explained above, this is essentially the point of NCV.

Also, something unrelated to your question but I noticed anyway:

If those numbers for RMSE are so highly variate, either (a) The dataset is small, meaning that important samples are likely to be excluded from training based on random chance, leading to some folds with much higher errors or (b) The models were not given enough time to converge. From your description, it seems like the second case, since you were using debug preset. Typically, AMM performance is much less variate across folds if you are not using debug.

Jorge_Alonso_Delgado · September 18, 2020, 9:43am

Alex!

Thank you so much for the explanations, now everything is certainly more clear. Based on them I was able to identify internal issues and make some important progresses during this week. I’m still doing some tests, so possibly will post again here to clarify a couple more of things.

Jorge

p.d. Regarding the accumulation of the “best_models” over the folds and store them somewhere, I totally agree, that would be something interesting to consider in the future (will track future issues in the github repo related to this topic).