Hi everybody,
After a few months since this topic was created, here I bring an update and possibly a partial solution related to ValueError(s) of the type:
ValueError: Unsupported set of arguments:
ValueError: x and y arrays must have at least 2 entries
ValueError: X contains negative values
ValueError: Found array with 0 feature(s)
when running benchmark methods in automatminer.
The final ValueError appeared to be aleatory, however in all the cases there were recurrent alerts recorded in the log file of the type:
_pre_test decorator: _random_mutation_operator: num_test=0 Found array with 0 feature(s)
As mentioned also by @Krkaufma in a related post, I also started thinking that there might be specific operators included in the default configuration of TPOT responsible of those aleatory crashes, and if detected we might succeed in completing autoML with express, production or heavy presets.
My approach was then to create a customized pipeline where I focused in the customization of the learner (config[“learner”]) where the kwargs passed into TPOT belonged from a customized config_dict. Such dictionary started having a few basic operators (e.g. only one regressor – one preprocessor – one reducer), and then included progressively operators, checking after every inclusion the stability of the run (my criteria was the absence of the “_pre_test decorator:…” alarm that used to appear during the first minutes of each run.
Doing so, out of the 27 operators included in the default express (production or heavy) preset, there were 3 operators which presence resulted in the appearance of the “pre_test decorator” alarm, while the other 24 were stable and the runs were completed successfully. The problematic 3 operators were the following:
‘sklearn.svm.LinearSVR’:
‘sklearn.feature_selection.SelectFwe’:
‘sklearn.feature_selection.SelectFromModel’:
That’s it!. The exclusion of these three operators worked for me, but It might be necessary further adjustment of the operator package depending on the specific dataset.
If somebody get further insights or a more elegant solution to these types of errors, I would be happy to hear about it here.
Below you can find the important pieces of code that I used to customize the pipe (learner) and the config_dict of TPOTAdaptor, ready to use for who would like to test this approach. Except for the config_dict, the configuration of the TPOTAdaptor showed below is basically the one of the default “production” preset in automaminer:
config_dict_1={'sklearn.ensemble.RandomForestRegressor': {'n_estimators': [20, 100, 200, 500, 1000],
'max_features': [0.05, 0.15, 0.25, 0.35, 0.45, 0.55, 0.65, 0.75, 0.85, 0.95],
'min_samples_split': range(2, 21, 3),
'min_samples_leaf': range(1, 21, 3),
'bootstrap': [True, False]},
'sklearn.ensemble.GradientBoostingRegressor': {'n_estimators': [20, 100, 200, 500, 1000],
'loss': ['ls', 'lad', 'huber', 'quantile'],
'learning_rate': [0.01, 0.1, 0.5, 1.0],
'max_depth': range(1, 11, 2),
'min_samples_split': range(2, 21, 3),
'min_samples_leaf': range(1, 21, 3),
'subsample': [0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 , 0.55,
0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1.],
'max_features': [0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 , 0.55,
0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1.],
'alpha': [0.75, 0.8, 0.85, 0.9, 0.95, 0.99]},
'sklearn.ensemble.ExtraTreesRegressor': {'n_estimators': [20, 100, 200, 500, 1000],
'max_features': [0.05, 0.15, 0.25, 0.35, 0.45, 0.55, 0.65, 0.75, 0.85, 0.95],
'min_samples_split': range(2, 21, 3),
'min_samples_leaf': range(1, 21, 3),
'bootstrap': [True, False]},
'sklearn.tree.DecisionTreeRegressor': {'max_depth': range(1, 11, 2),
'min_samples_split': range(2, 21, 3),
'min_samples_leaf': range(1, 21, 3)},
'sklearn.neighbors.KNeighborsRegressor': {'n_neighbors': range(1, 101),
'weights': ['uniform', 'distance'],
'p': [1, 2]},
'sklearn.linear_model.Lasso': {'alpha': [1e-2, 1e-1, 1e0, 1e1, 1e2]}, #J alpha values taken from Takigawa-2019
'sklearn.linear_model.LassoLarsCV': {'normalize': [True, False]},
'sklearn.linear_model.RidgeCV': {},
'sklearn.linear_model.ElasticNetCV': {'l1_ratio': [0. , 0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 , 0.55, 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1. ],
'tol': [1e-05, 0.0001, 0.001, 0.01, 0.1]},
'sklearn.preprocessing.MaxAbsScaler': {},
'sklearn.preprocessing.RobustScaler': {},
'sklearn.preprocessing.StandardScaler': {},
'sklearn.preprocessing.MinMaxScaler': {},
'sklearn.preprocessing.Normalizer': {'norm': ['l1', 'l2', 'max']},
'sklearn.preprocessing.PolynomialFeatures': {'degree': [2],
'include_bias': [False],
'interaction_only': [False]},
'sklearn.kernel_approximation.RBFSampler': {'gamma': [0. , 0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 , 0.55, 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1. ]},
'sklearn.kernel_approximation.Nystroem': {'kernel': ['rbf', 'cosine', 'chi2', 'laplacian', 'polynomial', 'poly', 'linear', 'additive_chi2','sigmoid'],
'gamma': [0. , 0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 , 0.55, 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1. ],
'n_components': range(1, 11)},
'tpot.builtins.ZeroCount': {},
'tpot.builtins.OneHotEncoder': {'minimum_fraction': [0.05, 0.1, 0.15, 0.2, 0.25],
'sparse': [False],
'threshold': [10]},
'sklearn.preprocessing.Binarizer': {'threshold': [0. , 0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ,
0.55, 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1. ]},
'sklearn.cluster.FeatureAgglomeration': {'linkage': ['ward', 'complete', 'average'],
'affinity': ['euclidean', 'l1', 'l2', 'manhattan', 'cosine']},
'sklearn.feature_selection.SelectPercentile': {'percentile': range(1, 100),
'score_func': {'sklearn.feature_selection.f_regression': None}},
'sklearn.decomposition.PCA': {'svd_solver': ['randomized'],
'iterated_power': range(1, 11)},
'sklearn.decomposition.FastICA': {'tol': [0. , 0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ,
0.55, 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1. ]},
'sklearn.feature_selection.VarianceThreshold': {'threshold': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.2]}}
from automatminer.pipeline import MatPipe
config = get_preset_config("production")
config["learner"] = TPOTAdaptor(max_time_mins=1440,
max_eval_time_mins=20,
cv=5,
verbosity=3,
memory='auto',
template='Selector-Transformer-Regressor',
scoring='neg_mean_absolute_error',
config_dict=config_dict_1)
pipe = MatPipe(**config)
predicted_folds = pipe.benchmark(df=df, target=target, kfold=kf)