Matbench Repost from Github: How to featurize a task?

Migrated from How to featurize a task? · Issue #36 · hackingmaterials/matbench · GitHub

@mm04926412 said:

Hello

First of all thanks for making this its a niche in the materials ml space that was lacking and before this benchmarking models was difficult but I’m having some difficulty implementing my model in the way set out. An important part of the ml process is the featurization of the structures and it doesn’t seem that there is a clear place to insert the featurization pipeline into the template provided in the matbench “how to” section.

Ideally I’d be able to featurize the entire task prior to the cross validation, this being done in such a way that the train input object has some additional column when queried which contains the featurized object the model expects to recieve. The way things are currently setup as the desired template it seems that featurization would either need to be done fold by fold (with 5-fold redundancy) or done “live” which is computationally undesirable.

Is there a way to access the underlying structure pandas series and modify it prior to the cross validation loop so I can insert a featurization pipeline?

The solution:

hi @mm04926412

You’re right, typically you featurize the entire dataset and can then train on these features. However, there is a subtle case of possible data leakage when you do this for benchmark datasets. Consider:

  1. You featurize using a method that is “fitted” - based on the input materials primitives such as composition or structure - on the entire dataset, including testing and training data.
  2. These features now include new features based on the primitives in the test set; for example, if your training set doesnt contain uranium but your test set does, and now you have a feature called "U-O bond length" or something when you otherwise wouldn’t
  3. Your ML model learns on this “leaked” feature and this is now a problem for benchmarking

So no, it’s probably not good to featurize the entire dataset before doing learning. You should separately featurize the train/val and test sets so they are completely separate at all stages. So your pipeline would look like this:

  1. Get training/validation dataset
  2. Generate set of features for this data
  3. Train and validate model on this data, including hyperparameter selection
  4. Get test data
  5. Generate the same set of features as in (2)
  6. Predict using the model from (3) and the features from (5)

If it is really computationally undesirable though, and you have no way around it… (not recommended, but possible)

… You can just load the raw data with the MatbenchTask.df attribute once it is loaded. Here’s an example:

from matbench.bench import MatbenchBenchmark
mb = MatbenchBenchmark.from_preset("matbench_v0.1", "regression")
t = mb.matbench_steels
t.load()

print(t.df)

The output is the raw dataframe

Out[7]: 
                                                     composition  yield strength
mbid                                                                            
mb-steels-001  Fe0.620C0.000953Mn0.000521Si0.00102Cr0.000110N...          2411.5
mb-steels-002  Fe0.623C0.00854Mn0.000104Si0.000203Cr0.147Ni0....          1123.1
mb-steels-003  Fe0.625Mn0.000102Si0.000200Cr0.0936Ni0.129Mo0....          1736.3
mb-steels-004  Fe0.634C0.000478Mn0.000523Si0.00102Cr0.000111N...          2487.3
mb-steels-005  Fe0.636C0.000474Mn0.000518Si0.00101Cr0.000109N...          2249.6
                                                          ...             ...
mb-steels-308  Fe0.823C0.0176Mn0.00183Si0.000198Cr0.0779Ni0.0...          1722.5
mb-steels-309  Fe0.823Mn0.000618Si0.00101Cr0.0561Ni0.0984Mo0....          1019.0
mb-steels-310  Fe0.825C0.0174Mn0.00175Si0.000201Cr0.0565Ni0.0...          1860.3
mb-steels-311  Fe0.858C0.0191Mn0.00194Si0.000199Cr0.0753Ni0.0...          1812.1
mb-steels-312  Fe0.860C0.0125Mn0.00274Si0.000198Cr0.00439Ni0....          1139.7
[312 rows x 2 columns]