Migrated from How to featurize a task? · Issue #36 · materialsproject/matbench · GitHub
@mm04926412 said:
Hello
First of all thanks for making this its a niche in the materials ml space that was lacking and before this benchmarking models was difficult but I’m having some difficulty implementing my model in the way set out. An important part of the ml process is the featurization of the structures and it doesn’t seem that there is a clear place to insert the featurization pipeline into the template provided in the matbench “how to” section.
Ideally I’d be able to featurize the entire task prior to the cross validation, this being done in such a way that the train input object has some additional column when queried which contains the featurized object the model expects to recieve. The way things are currently setup as the desired template it seems that featurization would either need to be done fold by fold (with 5-fold redundancy) or done “live” which is computationally undesirable.
Is there a way to access the underlying structure pandas series and modify it prior to the cross validation loop so I can insert a featurization pipeline?
The solution:
hi @mm04926412
You’re right, typically you featurize the entire dataset and can then train on these features. However, there is a subtle case of possible data leakage when you do this for benchmark datasets. Consider:
- You featurize using a method that is “fitted” - based on the input materials primitives such as composition or structure - on the entire dataset, including testing and training data.
- These features now include new features based on the primitives in the test set; for example, if your training set doesnt contain uranium but your test set does, and now you have a feature called
"U-O bond length"
or something when you otherwise wouldn’t - Your ML model learns on this “leaked” feature and this is now a problem for benchmarking
So no, it’s probably not good to featurize the entire dataset before doing learning. You should separately featurize the train/val and test sets so they are completely separate at all stages. So your pipeline would look like this:
- Get training/validation dataset
- Generate set of features for this data
- Train and validate model on this data, including hyperparameter selection
- Get test data
- Generate the same set of features as in (2)
- Predict using the model from (3) and the features from (5)
If it is really computationally undesirable though, and you have no way around it… (not recommended, but possible)
… You can just load the raw data with the MatbenchTask.df
attribute once it is loaded. Here’s an example:
from matbench.bench import MatbenchBenchmark
mb = MatbenchBenchmark.from_preset("matbench_v0.1", "regression")
t = mb.matbench_steels
t.load()
print(t.df)
The output is the raw dataframe
Out[7]:
composition yield strength
mbid
mb-steels-001 Fe0.620C0.000953Mn0.000521Si0.00102Cr0.000110N... 2411.5
mb-steels-002 Fe0.623C0.00854Mn0.000104Si0.000203Cr0.147Ni0.... 1123.1
mb-steels-003 Fe0.625Mn0.000102Si0.000200Cr0.0936Ni0.129Mo0.... 1736.3
mb-steels-004 Fe0.634C0.000478Mn0.000523Si0.00102Cr0.000111N... 2487.3
mb-steels-005 Fe0.636C0.000474Mn0.000518Si0.00101Cr0.000109N... 2249.6
... ...
mb-steels-308 Fe0.823C0.0176Mn0.00183Si0.000198Cr0.0779Ni0.0... 1722.5
mb-steels-309 Fe0.823Mn0.000618Si0.00101Cr0.0561Ni0.0984Mo0.... 1019.0
mb-steels-310 Fe0.825C0.0174Mn0.00175Si0.000201Cr0.0565Ni0.0... 1860.3
mb-steels-311 Fe0.858C0.0191Mn0.00194Si0.000199Cr0.0753Ni0.0... 1812.1
mb-steels-312 Fe0.860C0.0125Mn0.00274Si0.000198Cr0.00439Ni0.... 1139.7
[312 rows x 2 columns]