Automatminer predicting on unknown data

emmeau · July 9, 2019, 4:49pm

Hi,
I have a model for binary classification from automatminer that I’m pretty happy with - I now want to try to run the “predict” function on some unknown compounds. The predict function still needs a target column - can I just fill the target column randomly with 1s and 0s or will that bias the model somehow?

Thanks!

Emily S

ardunn · July 9, 2019, 5:12pm

Hey Emily,

The target column doesn’t have to exist in the data frame (if it is in the data frame, it is ignored during “predict”). The target arg is included in predict simply for consistency of syntax and as a sanity check (to make sure you are trying to predict the same quantity the pipeline was trained on, this has saved me many times). It also defines the output column name. Regardless, the target should be the same name as the one you fit on!

So the following should work:

pipe.fit(df_containing_target, target)

predictions = pipe.predict(df_not_containing_target, target)

Thanks,

Alex

···

On Tue, Jul 9, 2019 at 9:49 AM Emily [email protected] wrote:

Hi,
I have a model for binary classification from automatminer that I’m pretty happy with - I now want to try to run the “predict” function on some unknown compounds. The predict function still needs a target column - can I just fill the target column randomly with 1s and 0s or will that bias the model somehow?

Thanks!

Emily S

–

You received this message because you are subscribed to the Google Groups “matminer” group.

To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].

To view this discussion on the web visit https://groups.google.com/d/msgid/matminer/7b4f2153-2ae1-4221-b44d-3af1fd3085cf%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

emmeau · July 9, 2019, 5:59pm

Hi,
Okay, that makes sense. It would be pretty lame if the target values you input affect your model, I suppose haha
Thanks!

Emily

···

On Tuesday, July 9, 2019 at 10:13:00 AM UTC-7, Alex Dunn wrote:

Hey Emily,

The target column doesn’t have to exist in the data frame (if it is in the data frame, it is ignored during “predict”). The target arg is included in predict simply for consistency of syntax and as a sanity check (to make sure you are trying to predict the same quantity the pipeline was trained on, this has saved me many times). It also defines the output column name. Regardless, the target should be the same name as the one you fit on!

So the following should work:

pipe.fit(df_containing_target, target)

predictions = pipe.predict(df_not_containing_target, target)

Thanks,

Alex

On Tue, Jul 9, 2019 at 9:49 AM Emily [email protected] wrote:

Hi,
I have a model for binary classification from automatminer that I’m pretty happy with - I now want to try to run the “predict” function on some unknown compounds. The predict function still needs a target column - can I just fill the target column randomly with 1s and 0s or will that bias the model somehow?

Thanks!

Emily S

–

You received this message because you are subscribed to the Google Groups “matminer” group.

To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].

To view this discussion on the web visit https://groups.google.com/d/msgid/matminer/7b4f2153-2ae1-4221-b44d-3af1fd3085cf%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Anubhav_Jain · July 9, 2019, 6:00pm

I would vote for removing “target” from predict, this seems very confusing.

simply for consistency of syntax

Why does fit and predict need the same syntax? I’d rather have consistency of syntax with a normal scikit-learn model which doesn’t ask you for a target column in predict (since none is needed!)

to make sure you are trying to predict the same quantity the pipeline was trained on, this has saved me many times

This seems more a problem with the way you must be organizing your code? If you are having trouble keeping track which pipeline was trained on what, you could use descriptive variable names like

pipe_bandgap

or comments?

I’d like to hear a good argument as to why “target” is needed for predict

···

On Tuesday, July 9, 2019 at 10:13:00 AM UTC-7, Alex Dunn wrote:

Hey Emily,

The target column doesn’t have to exist in the data frame (if it is in the data frame, it is ignored during “predict”). The target arg is included in predict simply for consistency of syntax and as a sanity check (to make sure you are trying to predict the same quantity the pipeline was trained on, this has saved me many times). It also defines the output column name. Regardless, the target should be the same name as the one you fit on!

So the following should work:

pipe.fit(df_containing_target, target)

predictions = pipe.predict(df_not_containing_target, target)

Thanks,

Alex

On Tue, Jul 9, 2019 at 9:49 AM Emily [email protected] wrote:

Hi,
I have a model for binary classification from automatminer that I’m pretty happy with - I now want to try to run the “predict” function on some unknown compounds. The predict function still needs a target column - can I just fill the target column randomly with 1s and 0s or will that bias the model somehow?

Thanks!

Emily S

–

You received this message because you are subscribed to the Google Groups “matminer” group.

To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].

To view this discussion on the web visit https://groups.google.com/d/msgid/matminer/7b4f2153-2ae1-4221-b44d-3af1fd3085cf%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

ardunn · July 9, 2019, 9:31pm

I’m certainly not wedded to the idea of having target in predict as a required argument.

Current implementation is because the underlying classes of MatPipe (AutoFeaturizer, DataCleaner, FeatureReducer, all AutoMLAdaptors) use the same .transform operations in matpipe fit and matpipe predict, as many of the underlying operations are the same during fitting or prediction. For example, Autofeaturizer.transform creates descriptors mostly same way whether the operand the df being featurized for fitting or some other df during prediction. The underlying classes can just read their own .fitted_target to get the target during .transform I suppose, with the consequences that a bit of code complexity will be added.

Thoughts?

Thanks,

Alex

···

On Tuesday, July 9, 2019 at 11:00:06 AM UTC-7, Anubhav Jain wrote:

I would vote for removing “target” from predict, this seems very confusing.

simply for consistency of syntax

Why does fit and predict need the same syntax? I’d rather have consistency of syntax with a normal scikit-learn model which doesn’t ask you for a target column in predict (since none is needed!)

to make sure you are trying to predict the same quantity the pipeline was trained on, this has saved me many times

This seems more a problem with the way you must be organizing your code? If you are having trouble keeping track which pipeline was trained on what, you could use descriptive variable names like

pipe_bandgap

or comments?

I’d like to hear a good argument as to why “target” is needed for predict

On Tuesday, July 9, 2019 at 10:13:00 AM UTC-7, Alex Dunn wrote:

Hey Emily,

The target column doesn’t have to exist in the data frame (if it is in the data frame, it is ignored during “predict”). The target arg is included in predict simply for consistency of syntax and as a sanity check (to make sure you are trying to predict the same quantity the pipeline was trained on, this has saved me many times). It also defines the output column name. Regardless, the target should be the same name as the one you fit on!

So the following should work:

pipe.fit(df_containing_target, target)

predictions = pipe.predict(df_not_containing_target, target)

Thanks,

Alex

On Tue, Jul 9, 2019 at 9:49 AM Emily [email protected] wrote:

Hi,
I have a model for binary classification from automatminer that I’m pretty happy with - I now want to try to run the “predict” function on some unknown compounds. The predict function still needs a target column - can I just fill the target column randomly with 1s and 0s or will that bias the model somehow?

Thanks!

Emily S

–

You received this message because you are subscribed to the Google Groups “matminer” group.

To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].

To view this discussion on the web visit https://groups.google.com/d/msgid/matminer/7b4f2153-2ae1-4221-b44d-3af1fd3085cf%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Anubhav_Jain · August 6, 2019, 4:48pm

Note - I have added this as a github issue:

···

On Tuesday, July 9, 2019 at 2:31:28 PM UTC-7, Alexander Dunn wrote:

I’m certainly not wedded to the idea of having target in predict as a required argument.

Current implementation is because the underlying classes of MatPipe (AutoFeaturizer, DataCleaner, FeatureReducer, all AutoMLAdaptors) use the same .transform operations in matpipe fit and matpipe predict, as many of the underlying operations are the same during fitting or prediction. For example, Autofeaturizer.transform creates descriptors mostly same way whether the operand the df being featurized for fitting or some other df during prediction. The underlying classes can just read their own .fitted_target to get the target during .transform I suppose, with the consequences that a bit of code complexity will be added.

Thoughts?

Thanks,

Alex

On Tuesday, July 9, 2019 at 11:00:06 AM UTC-7, Anubhav Jain wrote:

I would vote for removing “target” from predict, this seems very confusing.

simply for consistency of syntax

Why does fit and predict need the same syntax? I’d rather have consistency of syntax with a normal scikit-learn model which doesn’t ask you for a target column in predict (since none is needed!)

to make sure you are trying to predict the same quantity the pipeline was trained on, this has saved me many times

This seems more a problem with the way you must be organizing your code? If you are having trouble keeping track which pipeline was trained on what, you could use descriptive variable names like

pipe_bandgap

or comments?

I’d like to hear a good argument as to why “target” is needed for predict

On Tuesday, July 9, 2019 at 10:13:00 AM UTC-7, Alex Dunn wrote:

Hey Emily,

The target column doesn’t have to exist in the data frame (if it is in the data frame, it is ignored during “predict”). The target arg is included in predict simply for consistency of syntax and as a sanity check (to make sure you are trying to predict the same quantity the pipeline was trained on, this has saved me many times). It also defines the output column name. Regardless, the target should be the same name as the one you fit on!

So the following should work:

pipe.fit(df_containing_target, target)

predictions = pipe.predict(df_not_containing_target, target)

Thanks,

Alex

On Tue, Jul 9, 2019 at 9:49 AM Emily [email protected] wrote:

Hi,
I have a model for binary classification from automatminer that I’m pretty happy with - I now want to try to run the “predict” function on some unknown compounds. The predict function still needs a target column - can I just fill the target column randomly with 1s and 0s or will that bias the model somehow?

Thanks!

Emily S

–

You received this message because you are subscribed to the Google Groups “matminer” group.

To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].

To view this discussion on the web visit https://groups.google.com/d/msgid/matminer/7b4f2153-2ae1-4221-b44d-3af1fd3085cf%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

ardunn · August 7, 2019, 11:58pm

Hey Emily,

You can now use MatPipe predict without target. Pull the latest commits if you’d like this capability!

Thanks,

Alex