I am a seasoned material scientist, but a newbie at machine learning. I am enjoying the process of applying machine learning to my field and have enjoyed going the matminer topics. To further my understanding, I have been reading “Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow” by Aurelien Geron. I notice that he worries about “over fitting” and suggests feature reduction and among other things suggests using principle components analysis as a vehicle to achieve this. In going through the matminer_example repository, I read through the formation_e (formation energy) notebook that partially recreates the 2016 paper by Ward et al. attempting to predict the formation energy. In particular this notebook (and the paper) generates a large number of features from the composition. This is reasonable in and of itself, but the large number of features (about 140) made me wonder about the need (or not) for feature reduction. In fact, doing a PCA analysis with even 0.99 of the variance specified only results in three unique features suggesting that there is overfitting due to the use of a large feature set. Of course, PCA analysis is linear and will likely miss non-linear contributions, but I wonder about the massive difference in degrees of freedom the PCA analysis suggests. So for those experts in the Materials Project (and outside of it as well), what are your thoughts on this issue?
Some kind of feature reduction on the Magpie feature set is usually a good idea, as you said many of the features are correlated. This is one reason why automatminer has a feature reduction step in the pipeline
Regarding the 0.99 variance threshold PCA resulting in 3 features, I believe each of these features is a linear combination of features sourced from the original 140, no? So while the features can be condensed down into “only” 3 principal components, in reality they represent much more information than that. That aside, I certainly understand your concern about overfitting.
When it comes to general-purpose features generated ahead of training time, especially for features based on fundamental chemical or physical characteristics (e.g., those in the notebook), some subset of features will be useful for most problems but any one problem might only have a few useful features for learning. If we take the ~140 MagPie features as an example, you might find a subset of 10 features useful for predicting band gap and a totally different set of 5 features useful for predicting formation enthalpy.
In other words:
if you are applying a general purpose set of a-priori generated features (like those you see in the notebook) to a specific problem, feature reduction is more of a necessity than a suggestion. You are best off aggregating many features from methods you think might be relevant, doing feature reduction, and then learning on that data.
Thank you for your efforts at matminer. It is an interesting project and I hope to both motivate my students to learn more about machine learning as well as to learn how best to apply machine learning in my own materials research.
Following up on your earlier comments about feature reduction. In my first message, I tried using PCA for feature reduction of the magpie data derived from composition data, but it seemed like overkill. Can you recommend some specific ways to achieve feature reduction (let’s say with the element derived magpie data for example?). What is a reasonable course of action.
PCA isn’t a bad option. There are other simple statistical filters you can use though. This article I found has a nice table of some simple ones:
One I would recommend is simply removing the features cross-correlated by more than a threshold value.