I am a seasoned material scientist, but a newbie at machine learning. I am enjoying the process of applying machine learning to my field and have enjoyed going the matminer topics. To further my understanding, I have been reading “Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow” by Aurelien Geron. I notice that he worries about “over fitting” and suggests feature reduction and among other things suggests using principle components analysis as a vehicle to achieve this. In going through the matminer_example repository, I read through the formation_e (formation energy) notebook that partially recreates the 2016 paper by Ward et al. attempting to predict the formation energy. In particular this notebook (and the paper) generates a large number of features from the composition. This is reasonable in and of itself, but the large number of features (about 140) made me wonder about the need (or not) for feature reduction. In fact, doing a PCA analysis with even 0.99 of the variance specified only results in three unique features suggesting that there is overfitting due to the use of a large feature set. Of course, PCA analysis is linear and will likely miss non-linear contributions, but I wonder about the massive difference in degrees of freedom the PCA analysis suggests. So for those experts in the Materials Project (and outside of it as well), what are your thoughts on this issue?
Some kind of feature reduction on the Magpie feature set is usually a good idea, as you said many of the features are correlated. This is one reason why automatminer has a feature reduction step in the pipeline
Regarding the 0.99 variance threshold PCA resulting in 3 features, I believe each of these features is a linear combination of features sourced from the original 140, no? So while the features can be condensed down into “only” 3 principal components, in reality they represent much more information than that. That aside, I certainly understand your concern about overfitting.
When it comes to general-purpose features generated ahead of training time, especially for features based on fundamental chemical or physical characteristics (e.g., those in the notebook), some subset of features will be useful for most problems but any one problem might only have a few useful features for learning. If we take the ~140 MagPie features as an example, you might find a subset of 10 features useful for predicting band gap and a totally different set of 5 features useful for predicting formation enthalpy.
In other words:
if you are applying a general purpose set of a-priori generated features (like those you see in the notebook) to a specific problem, feature reduction is more of a necessity than a suggestion. You are best off aggregating many features from methods you think might be relevant, doing feature reduction, and then learning on that data.
Thank you for your efforts at matminer. It is an interesting project and I hope to both motivate my students to learn more about machine learning as well as to learn how best to apply machine learning in my own materials research.
Following up on your earlier comments about feature reduction. In my first message, I tried using PCA for feature reduction of the magpie data derived from composition data, but it seemed like overkill. Can you recommend some specific ways to achieve feature reduction (let’s say with the element derived magpie data for example?). What is a reasonable course of action.
PCA isn’t a bad option. There are other simple statistical filters you can use though. This article I found has a nice table of some simple ones:
One I would recommend is simply removing the features cross-correlated by more than a threshold value.
From Andrew Ng’s famous Coursera class on ML, he says that PCA is a visualization technique, not a feature reduction technique. For feature reduction, use regularization & cross-validation in your ML tuning. (Regularization examples: Lasso, ElasticNet, Ridge…)
Most ML techniques have a method to select the most important features, then you need to use cross-validation to ensure that you aren’t overfitting.
@flac_tph You are correct in that you need to use some sort of validation in order to ensure you aren’t overfitting. However, I am not sure that statement about PCA is interpreted correctly. See the following direct quote from Andrew Ng:
PCA has many applications; we will close our discussion with a few examples. First, compression—representing x(i)’s with lower dimension y(i)’s—is an obvious application. If we reduce high dimensional data to k = 2 or 3 dimensions, then we can also plot the y(i)’s to visualize the data. For instance, if we were to reduce our automobiles data to 2 dimensions, then we can plot it (one point in our plot would correspond to one car type, say) to see what cars are similar to each other and what groups of cars may cluster together. Another standard application is to preprocess a dataset to reduce its dimension before running a supervised learning learning algorithm with the x(i)’s as inputs. Apart from computational benefits, reducing the data’s dimension can also reduce the complexity of the hypothesis class considered and help avoid overfitting (e.g., linear classifiers over lower dimensional input spaces will have smaller VC dimension).
Source: Andrew Ng, CS229 Course Notes Spring 2020 link
I think people might be overloading “feature reduction” here. When some hear it, they think “feature selection” while others think “dimensionality reduction.”
Dimensionality reduction (as the name suggests) reduces your feature space down to a smaller dimension while trying to retain as much of what differentiates your data as possible. Feature selection tries to eliminate features that are not useful for your task. You can use both to reduce over-fitting.
Let’s say a set of hypothetical features set which includes, among other things, the number of letters in the composition’s name written out in English and the material’s bandgap. Because
number of letters is very different between composition, this feature would probably feature prominently in PCA-derived features but would not be helpful for predicting anything physically meaningful. Likewise, bandgap might get combined with other features related to composition because it tends to correlate with them (oxides have higher band gap then metals.) In this case, we might be losing valuable information because a model might have been able to use detailed information about a material’s bandgap to make better predictions about it’s ZT or conductivity. Even small gaps above 0 are very different than a gap of 0 in terms of materials properties.
If you suspect you have lots of irrelevant features in your training set, you can do cross-validation on subsets of features to discover which features are actually useful to include. We can also do regularization to encourage models to learn sparse weights that might ignore some of these irrelevant features.
However, what if we have a lot of potentially useful features but not enough training data? Our model (which will require lots of parameters) will probably over-fit our dataset and learn the particular fingerprints of our samples instead of than the real physical relationships underlying them. In this case, we may benefit from doing dimensionality reduction to reduce the number of features to something we can reasonably expect to not overfit on. Since all of the features were meaningful, we are not biasing our model inappropriately.