Using vtreat with Multinomial Classification Problems

Nina Zumel and John Mount November 2019

Note: this is a description of the Python version of vtreat, the same example for the R version of vtreat can be found here.

Preliminaries

Load modules/packages.

import pkg_resources
import pandas
import numpy
import numpy.random
import seaborn
import matplotlib.pyplot as plt
import vtreat
import vtreat.util
import wvpy.util

numpy.random.seed(2019)

Generate example data.

y is a noisy sinusoidal function of the variable x
yc is the multiple class output to be predicted: : y's quantized value as 'large', 'liminal', or 'small'.
Input xc is a categorical variable that represents a discretization of y, along with some NaNs
Input x2 is a pure noise variable with no relationship to the output

def make_data(nrows):
    d = pandas.DataFrame({'x': 5*numpy.random.normal(size=nrows)})
    d['y'] = numpy.sin(d['x']) + 0.1*numpy.random.normal(size=nrows)
    d.loc[numpy.arange(3, 10), 'x'] = numpy.nan                           # introduce a nan level
    d['xc'] = ['level_' + str(5*numpy.round(yi/5, 1)) for yi in d['y']]
    d['x2'] = numpy.random.normal(size=nrows)
    d.loc[d['xc']=='level_-1.0', 'xc'] = numpy.nan  # introduce a nan level
    d['yc'] = numpy.where(d['y']>0.5, 'large', numpy.where(d['y']<-0.5, 'small', 'liminal'))
    return d

d = make_data(500)

d.head()

	x	y	xc	x2	yc
0	-1.088395	-0.956311	NaN	-1.424184	small
1	4.107277	-0.671564	level_-0.5	0.427360	small
2	7.406389	0.906303	level_1.0	0.668849	large
3	NaN	0.222792	level_0.0	-0.015787	liminal
4	NaN	-0.975431	NaN	-0.491017	small

Some quick data exploration

Check how many levels xc has, and their distribution (including NaN)

d['xc'].unique()

array([nan, 'level_-0.5', 'level_1.0', 'level_0.0', 'level_-0.0',
       'level_0.5'], dtype=object)

d['xc'].value_counts(dropna=False)

level_1.0     140
NaN           109
level_-0.5    103
level_0.5      75
level_0.0      37
level_-0.0     36
Name: xc, dtype: int64

Show the distribution of yc

d['yc'].value_counts(dropna=False)

large      175
small      166
liminal    159
Name: yc, dtype: int64

Build a transform appropriate for classification problems.

Now that we have the data, we want to treat it prior to modeling: we want training data where all the input variables are numeric and have no missing values or NaNs.

First create the data treatment transform object, in this case a treatment for a multinomial classification problem.

transform = vtreat.MultinomialOutcomeTreatment(
    outcome_name='yc',    # outcome variable
    cols_to_copy=['y'],   # columns to "carry along" but not treat as input variables
)

Use the training data d to fit the transform and return a treated training set: completely numeric, with no missing values. Note that for the training data d, transform.fit_transform() is not the same as transform.fit().transform(); the second call can lead to nested model bias in some situations, and is not recommended. For other, later data, not seen during transform design transform.transform(o) is an appropriate step.

d_prepared = transform.fit_transform(d, d['yc'])

Now examine the score frame, which gives information about each new variable, including its type, which original variable it is derived from, its (cross-validated) correlation with the outcome, and its (cross-validated) significance as a one-variable linear model for the outcome.

transform.score_frame_

	variable	orig_variable	treatment	y_aware	has_range	PearsonR	R2	significance	vcount	default_threshold	recommended	outcome_target
0	x_is_bad	x	missing_indicator	False	True	-0.051749	0.002388	2.137073e-01	2.0	0.100000	False	large
1	xc_is_bad	xc	missing_indicator	False	True	-0.387438	0.169451	0.000000e+00	2.0	0.100000	True	large
2	x	x	clean_copy	False	True	0.052826	0.002158	2.371412e-01	2.0	0.100000	False	large
3	x2	x2	clean_copy	False	True	0.069126	0.003709	1.212047e-01	2.0	0.100000	False	large
4	xc_logit_code_liminal	xc	logit_code	True	True	-0.447451	0.163299	0.000000e+00	3.0	0.066667	False	large
5	xc_logit_code_large	xc	logit_code	True	True	0.867882	0.833866	0.000000e+00	3.0	0.066667	True	large
6	xc_logit_code_small	xc	logit_code	True	True	-0.631421	0.825622	0.000000e+00	3.0	0.066667	False	large
7	xc_prevalence_code	xc	prevalence_code	False	True	0.567968	0.341474	0.000000e+00	1.0	0.200000	True	large
8	xc_lev_level_1_0	xc	indicator_code	False	True	0.849837	0.645319	0.000000e+00	4.0	0.050000	True	large
9	xc_lev__NA_	xc	indicator_code	False	True	-0.387438	0.169451	0.000000e+00	4.0	0.050000	True	large
10	xc_lev_level_-0_5	xc	indicator_code	False	True	-0.373767	0.158566	0.000000e+00	4.0	0.050000	True	large
11	xc_lev_level_0_5	xc	indicator_code	False	True	0.102752	0.007894	2.377379e-02	4.0	0.050000	True	large
12	x_is_bad	x	missing_indicator	False	True	0.028292	0.000609	5.371098e-01	2.0	0.100000	False	liminal
13	xc_is_bad	xc	missing_indicator	False	True	-0.360534	0.155143	0.000000e+00	2.0	0.100000	True	liminal
14	x	x	clean_copy	False	True	-0.060936	0.002973	1.726990e-01	2.0	0.100000	False	liminal
15	x2	x2	clean_copy	False	True	0.001912	0.000003	9.658994e-01	2.0	0.100000	False	liminal
16	xc_logit_code_liminal	xc	logit_code	True	True	0.705192	0.592554	0.000000e+00	3.0	0.066667	True	liminal
17	xc_logit_code_large	xc	logit_code	True	True	-0.268632	0.060381	8.003610e-10	3.0	0.066667	False	liminal
18	xc_logit_code_small	xc	logit_code	True	True	-0.201252	0.033368	4.923164e-06	3.0	0.066667	False	liminal
19	xc_prevalence_code	xc	prevalence_code	False	True	-0.711666	0.482260	0.000000e+00	1.0	0.200000	True	liminal
20	xc_lev_level_1_0	xc	indicator_code	False	True	-0.425828	0.209794	0.000000e+00	4.0	0.050000	True	liminal
21	xc_lev__NA_	xc	indicator_code	False	True	-0.360534	0.155143	0.000000e+00	4.0	0.050000	True	liminal
22	xc_lev_level_-0_5	xc	indicator_code	False	True	0.140658	0.015196	2.051361e-03	4.0	0.050000	True	liminal
23	xc_lev_level_0_5	xc	indicator_code	False	True	0.194241	0.028310	2.580974e-05	4.0	0.050000	True	liminal
24	x_is_bad	x	missing_indicator	False	True	0.024435	0.000451	5.921961e-01	2.0	0.100000	False	small
25	xc_is_bad	xc	missing_indicator	False	True	0.748935	0.489005	0.000000e+00	2.0	0.100000	True	small
26	x	x	clean_copy	False	True	0.006755	0.000036	8.799326e-01	2.0	0.100000	False	small
27	x2	x2	clean_copy	False	True	-0.071903	0.004068	1.078311e-01	2.0	0.100000	False	small
28	xc_logit_code_liminal	xc	logit_code	True	True	-0.244170	0.047673	3.700307e-08	3.0	0.066667	False	small
29	xc_logit_code_large	xc	logit_code	True	True	-0.613362	0.707546	0.000000e+00	3.0	0.066667	False	small
30	xc_logit_code_small	xc	logit_code	True	True	0.838533	0.771629	0.000000e+00	3.0	0.066667	True	small
31	xc_prevalence_code	xc	prevalence_code	False	True	0.128509	0.013351	3.580040e-03	1.0	0.200000	True	small
32	xc_lev_level_1_0	xc	indicator_code	False	True	-0.439636	0.218219	0.000000e+00	4.0	0.050000	True	small
33	xc_lev__NA_	xc	indicator_code	False	True	0.748935	0.489005	0.000000e+00	4.0	0.050000	True	small
34	xc_lev_level_-0_5	xc	indicator_code	False	True	0.239464	0.042966	1.734635e-07	4.0	0.050000	True	small
35	xc_lev_level_0_5	xc	indicator_code	False	True	-0.296154	0.105297	3.330669e-16	4.0	0.050000	True	small

Note that the variable xc has been converted to multiple variables:

an indicator variable for each possible level (xc_lev_level_*)
the value of a (cross-validated) one-variable "one versus rest" model for yc as a function of xc; one per possible outcome class (xc_logit_code_*)
a variable that returns how prevalent this particular value of xc is in the training data (xc_prevalence_code)
a variable indicating when xc was NaN in the original data (xc_is_bad)

Any or all of these new variables are available for downstream modeling.

Variables of type logit_code_* are useful when dealing with categorical variables with a very large number of possible levels. For example, a categorical variable with 10,000 possible values potentially converts to 10,000 indicator variables, which may be unwieldy for some modeling methods. Using one numerical variable of type logit_code_* per outcome target may be a preferable alternative.

Unlike the other vtreat treatments (Numeric, Binomial, Unsupervised), the score frame here has more rows than created variables, because the significance of each variable is evaluated against each possible outcome target.

The recommended column indicates which variables are non constant (has_range == True) and have a significance value smaller than default_threshold with respect to a particular outcome target. See the section Deriving the Default Thresholds below for the reasoning behind the default thresholds. Recommended columns are intended as advice about which variables appear to be most likely to be useful in a downstream model. This advice attempts to be conservative, to reduce the possibility of mistakenly eliminating variables that may in fact be useful (although, obviously, it can still mistakenly eliminate variables that have a real but non-linear relationship to the output, as is the case with x, in our example). Since each variable has multiple recommendations, one can consider a variable to be recommended if it is recommended for any of the outcome targets: an OR of all the recommendations.

Examining variables

To select variables we either make our selection in terms of new variables as follows.

score_frame = transform.score_frame_
good_new_variables = score_frame.variable[score_frame.recommended].unique()
good_new_variables

array(['xc_is_bad', 'xc_logit_code_large', 'xc_prevalence_code',
       'xc_lev_level_1_0', 'xc_lev__NA_', 'xc_lev_level_-0_5',
       'xc_lev_level_0_5', 'xc_logit_code_liminal', 'xc_logit_code_small'],
      dtype=object)

Or in terms of original variables as follows.

good_original_variables = score_frame.orig_variable[score_frame.recommended].unique()
good_original_variables

array(['xc'], dtype=object)

Notice, in each case we must call unique as each variable (derived or original) is potentially qualified against each possible outcome.

Notice that, by default, d_prepared only includes recommended variables (along with y and yc):

d_prepared.head()

	y	yc	xc_is_bad	xc_logit_code_liminal	xc_logit_code_large	xc_logit_code_small	xc_prevalence_code	xc_lev_level_1_0	xc_lev__NA_	xc_lev_level_-0_5
0	-0.956311	small	1.0	-5.745320	-5.837138	1.099069	0.218	0.0	1.0	0.0
1	-0.671564	small	0.0	0.315510	-5.835271	0.517553	0.206	0.0	0.0	1.0
2	0.906303	large	0.0	-5.749526	1.047935	-5.793186	0.280	1.0	0.0	0.0
3	0.222792	liminal	0.0	1.137167	-5.776219	-5.726406	0.074	0.0	0.0	0.0
4	-0.975431	small	1.0	-5.745742	-5.837590	1.099070	0.218	0.0	1.0	0.0

This is vtreats default behavior; to include all variables in the prepared data, set the parameter filter_to_recommended to False, as we show later, in the Parameters for MultinomialOutcomeTreatment section below.

Using the Prepared Data in a Model

Of course, what we really want to do with the prepared training data is to fit a model jointly with all the (recommended) variables. Let's try fitting a logistic regression model to d_prepared.

import sklearn.linear_model
import seaborn

not_variables = ['y', 'yc', 'prediction', 'prob_on_predicted_class', 'predict', 'large', 'liminal', 'small', 'prob_on_correct_class']
model_vars = [v for v in d_prepared.columns if v not in set(not_variables)]

fitter = sklearn.linear_model.LogisticRegression(
    solver = 'saga',
    penalty = 'l2',
    C = 1,
    max_iter = 1000,
    multi_class = 'multinomial')
fitter.fit(d_prepared[model_vars], d_prepared['yc'])

LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='multinomial', n_jobs=None, penalty='l2',
                   random_state=None, solver='saga', tol=0.0001, verbose=0,
                   warm_start=False)

# convenience functions for predicting and adding predictions to original data frame

def add_predictions(d_prepared, model_vars, fitter):
    pred = fitter.predict_proba(d_prepared[model_vars])
    classes = fitter.classes_
    d_prepared['prob_on_predicted_class'] = 0
    d_prepared['predict'] = None
    for i in range(len(classes)):
        cl = classes[i]
        d_prepared[cl] = pred[:, i]
        improved = d_prepared[cl] > d_prepared['prob_on_predicted_class']
        d_prepared.loc[improved, 'predict'] = cl
        d_prepared.loc[improved, 'prob_on_predicted_class'] = d_prepared.loc[improved, cl]
    return d_prepared

def add_value_by_column(d_prepared, name_column, new_column):
    vals = d_prepared[name_column].unique()
    d_prepared[new_column] = None
    for v in vals:
        matches = d_prepared[name_column]==v
        d_prepared.loc[matches, new_column] = d_prepared.loc[matches, v]
    return d_prepared

# now predict
d_prepared = add_predictions(d_prepared, model_vars, fitter)
d_prepared = add_value_by_column(d_prepared, 'yc', 'prob_on_correct_class')
to_print=['yc', 'predict','large','liminal','small', 'prob_on_predicted_class','prob_on_correct_class']
d_prepared[to_print].head()

	yc	predict	large	liminal	small	prob_on_predicted_class	prob_on_correct_class
0	small	small	0.000344	0.000630	0.999026	0.999026	0.999026
1	small	small	0.000370	0.437370	0.562260	0.562260	0.56226
2	large	large	0.999188	0.000550	0.000261	0.999188	0.999188
3	liminal	liminal	0.000794	0.998388	0.000818	0.998388	0.998388
4	small	small	0.000344	0.000630	0.999026	0.999026	0.999026

Here, the columns large, liminal and small give the predicted probability of each target outcome and predict gives the predicted (most probable) class. The column prob_on_predicted_class returns the predicted probability of the predicted class, and prob_on_correct_class returns the predicted probability of the actual class.

We can compare the predictions to actual outcomes with a confusion matrix:

import sklearn.metrics

print(fitter.classes_)    
sklearn.metrics.confusion_matrix(d_prepared.yc, d_prepared.predict, labels=fitter.classes_)

['large' 'liminal' 'small']





array([[140,  35,   0],
       [  0, 113,  46],
       [  0,   0, 166]])

In the above confusion matrix, the entry [row, column] gives the number of true items of class[row] that also have prediction of class[column]. In other words, the entry [1,2] gives the number of 'large' items predicted to be 'liminal'.

Now apply the model to new data.

# create the new data
dtest = make_data(450)

# prepare the new data with vtreat
dtest_prepared = transform.transform(dtest)

# apply the model to the prepared data
dtest_prepared = add_predictions(dtest_prepared, model_vars, fitter)
dtest_prepared = add_value_by_column(dtest_prepared, 'yc', 'prob_on_correct_class')

dtest_prepared[to_print].head()

	yc	predict	large	liminal	small	prob_on_predicted_class	prob_on_correct_class
0	large	large	0.999192	0.000548	0.000261	0.999192	0.999192
1	liminal	liminal	0.465065	0.534503	0.000432	0.534503	0.534503
2	large	liminal	0.465065	0.534503	0.000432	0.534503	0.465065
3	large	large	0.999192	0.000548	0.000261	0.999192	0.999192
4	liminal	small	0.000367	0.445570	0.554063	0.554063	0.44557

print(fitter.classes_)    
sklearn.metrics.confusion_matrix(dtest_prepared.yc, dtest_prepared.predict, labels=fitter.classes_)

['large' 'liminal' 'small']





array([[ 90,  52,   0],
       [  0, 112,  41],
       [  0,   0, 155]])

Parameters for `MultinomialOutcomeTreatment`

We've tried to set the defaults for all parameters so that vtreat is usable out of the box for most applications.

vtreat.vtreat_parameters()

{'use_hierarchical_estimate': True,
 'coders': {'clean_copy',
  'deviation_code',
  'impact_code',
  'indicator_code',
  'logit_code',
  'missing_indicator',
  'prevalence_code'},
 'filter_to_recommended': True,
 'indicator_min_fraction': 0.1,
 'cross_validation_plan': vtreat.cross_plan.KWayCrossPlanYStratified(),
 'cross_validation_k': 5,
 'user_transforms': [],
 'sparse_indicators': True,
 'missingness_imputation': <function numpy.mean(a, axis=None, dtype=None, out=None, keepdims=<no value>)>,
 'check_for_duplicate_frames': True,
 'retain_cross_plan': False}

use_hierarchical_estimate:: When True, uses hierarchical smoothing when estimating logit_code variables; when False, uses unsmoothed logistic regression.

coders: The types of synthetic variables that vtreat will (potentially) produce. See Types of prepared variables below.

filter_to_recommended: When True, prepared data only includes variables marked as "recommended" in score frame. When False, prepared data includes all variables. See the Example below.

indicator_min_fraction: For categorical variables, indicator variables (type indicator_code) are only produced for levels that are present at least indicator_min_fraction of the time. A consequence of this is that 1/indicator_min_fraction is the maximum number of indicators that will be produced for a given categorical variable. To make sure that all possible indicator variables are produced, set indicator_min_fraction = 0

cross_validation_plan: The cross validation method used by vtreat. Most people won't have to change this.

cross_validation_k: The number of folds to use for cross-validation

user_transforms: For passing in user-defined transforms for custom data preparation. Won't be needed in most situations, but see here for an example of applying a GAM transform to input variables.

sparse_indicators: When True, use a (Pandas) sparse representation for indicator variables. This representation is compatible with sklearn; however, it may not be compatible with other modeling packages. When False, use a dense representation.

missingness_imputation The function or value that vtreat uses to impute or "fill in" missing numerical values. The default is numpy.mean(). To change the imputation function or use different functions/values for different columns, see the Imputation example.

Example: Use all variables to model, not just recommended

transform_all = vtreat.MultinomialOutcomeTreatment(
    outcome_name='yc',    # outcome variable
    cols_to_copy=['y'],   # columns to "carry along" but not treat as input variables
    params = vtreat.vtreat_parameters({
        'filter_to_recommended': False
    })
)  

# the variable columns in the transformed data
omit = ['x', 'y','yc']
columns = transform_all.fit_transform(d, d['yc']).columns
the_vars = list(set(columns)-set(omit))
the_vars.sort()
the_vars

['x2',
 'x_is_bad',
 'xc_is_bad',
 'xc_lev__NA_',
 'xc_lev_level_-0_5',
 'xc_lev_level_0_5',
 'xc_lev_level_1_0',
 'xc_logit_code_large',
 'xc_logit_code_liminal',
 'xc_logit_code_small',
 'xc_prevalence_code']

# the variables marked "recommended" by the transform
score_frame = transform_all.score_frame_
recommended = list(score_frame.variable[score_frame.recommended].unique())
recommended.sort()
recommended

['xc_is_bad',
 'xc_lev__NA_',
 'xc_lev_level_-0_5',
 'xc_lev_level_0_5',
 'xc_lev_level_1_0',
 'xc_logit_code_large',
 'xc_logit_code_liminal',
 'xc_logit_code_small',
 'xc_prevalence_code']

Note that the prepared data produced by fit_transform() includes all the variables, including those that were not marked as "recommended" (if any).

Types of prepared variables

clean_copy: Produced from numerical variables: a clean numerical variable with no NaNs or missing values

indicator_code: Produced from categorical variables, one for each (common) level: for each level of the variable, indicates if that level was "on"

prevalence_code: Produced from categorical variables: indicates how often each level of the variable was "on"

logit_code: Produced from categorical variables: score from a one-dimensional "one versus rest" model of the centered output as a function of the variable. One logit_code variable is produced for each target class.

missing_indicator: Produced for both numerical and categorical variables: an indicator variable that marks when the original variable was missing or NaN

deviation_code: not used by MultinomialOutcomeTreatment

impact_code: not used by MultinomialOutcomeTreatment

Example: Produce only a subset of variable types

In this example, suppose you only want to use indicators and continuous variables in your model; in other words, you only want to use variables of types (clean_copy, missing_indicator, and indicator_code), and no logit_code or prevalence_code variables.

transform_thin = vtreat.MultinomialOutcomeTreatment(
    outcome_name='yc',    # outcome variable
    cols_to_copy=['y'],   # columns to "carry along" but not treat as input variables
    params = vtreat.vtreat_parameters({
        'filter_to_recommended': False,
        'coders': {'clean_copy',
                   'missing_indicator',
                   'indicator_code',
                  }
    })
)

transform_thin.fit_transform(d, d['yc']).head()

	y	yc	x_is_bad	xc_is_bad	x	x2	xc_lev_level_1_0	xc_lev__NA_	xc_lev_level_-0_5
0	-0.956311	small	0.0	1.0	-1.088395	-1.424184	0.0	1.0	0.0
1	-0.671564	small	0.0	0.0	4.107277	0.427360	0.0	0.0	1.0
2	0.906303	large	0.0	0.0	7.406389	0.668849	1.0	0.0	0.0
3	0.222792	liminal	1.0	0.0	-0.057044	-0.015787	0.0	0.0	0.0
4	-0.975431	small	1.0	1.0	-0.057044	-0.491017	0.0	1.0	0.0

transform_thin.score_frame_

	variable	orig_variable	treatment	y_aware	has_range	PearsonR	R2	significance	vcount	default_threshold	recommended	outcome_target
0	x_is_bad	x	missing_indicator	False	True	-0.051749	0.002388	2.137073e-01	2.0	0.166667	False	large
1	xc_is_bad	xc	missing_indicator	False	True	-0.387438	0.169451	0.000000e+00	2.0	0.166667	True	large
2	x	x	clean_copy	False	True	0.052826	0.002158	2.371412e-01	2.0	0.166667	False	large
3	x2	x2	clean_copy	False	True	0.069126	0.003709	1.212047e-01	2.0	0.166667	True	large
4	xc_lev_level_1_0	xc	indicator_code	False	True	0.849837	0.645319	0.000000e+00	4.0	0.083333	True	large
5	xc_lev__NA_	xc	indicator_code	False	True	-0.387438	0.169451	0.000000e+00	4.0	0.083333	True	large
6	xc_lev_level_-0_5	xc	indicator_code	False	True	-0.373767	0.158566	0.000000e+00	4.0	0.083333	True	large
7	xc_lev_level_0_5	xc	indicator_code	False	True	0.102752	0.007894	2.377379e-02	4.0	0.083333	True	large
8	x_is_bad	x	missing_indicator	False	True	0.028292	0.000609	5.371098e-01	2.0	0.166667	False	liminal
9	xc_is_bad	xc	missing_indicator	False	True	-0.360534	0.155143	0.000000e+00	2.0	0.166667	True	liminal
10	x	x	clean_copy	False	True	-0.060936	0.002973	1.726990e-01	2.0	0.166667	False	liminal
11	x2	x2	clean_copy	False	True	0.001912	0.000003	9.658994e-01	2.0	0.166667	False	liminal
12	xc_lev_level_1_0	xc	indicator_code	False	True	-0.425828	0.209794	0.000000e+00	4.0	0.083333	True	liminal
13	xc_lev__NA_	xc	indicator_code	False	True	-0.360534	0.155143	0.000000e+00	4.0	0.083333	True	liminal
14	xc_lev_level_-0_5	xc	indicator_code	False	True	0.140658	0.015196	2.051361e-03	4.0	0.083333	True	liminal
15	xc_lev_level_0_5	xc	indicator_code	False	True	0.194241	0.028310	2.580974e-05	4.0	0.083333	True	liminal
16	x_is_bad	x	missing_indicator	False	True	0.024435	0.000451	5.921961e-01	2.0	0.166667	False	small
17	xc_is_bad	xc	missing_indicator	False	True	0.748935	0.489005	0.000000e+00	2.0	0.166667	True	small
18	x	x	clean_copy	False	True	0.006755	0.000036	8.799326e-01	2.0	0.166667	False	small
19	x2	x2	clean_copy	False	True	-0.071903	0.004068	1.078311e-01	2.0	0.166667	True	small
20	xc_lev_level_1_0	xc	indicator_code	False	True	-0.439636	0.218219	0.000000e+00	4.0	0.083333	True	small
21	xc_lev__NA_	xc	indicator_code	False	True	0.748935	0.489005	0.000000e+00	4.0	0.083333	True	small
22	xc_lev_level_-0_5	xc	indicator_code	False	True	0.239464	0.042966	1.734635e-07	4.0	0.083333	True	small
23	xc_lev_level_0_5	xc	indicator_code	False	True	-0.296154	0.105297	3.330669e-16	4.0	0.083333	True	small

Deriving the Default Thresholds

While machine learning algorithms are generally tolerant to a reasonable number of irrelevant or noise variables, too many irrelevant variables can lead to serious overfit; see this article for an extreme example, one we call "Bad Bayes". The default threshold is an attempt to eliminate obviously irrelevant variables early.

Imagine that you have a pure noise dataset, where none of the n inputs are related to the output. If you treat each variable as a one-variable model for the output, and look at the significances of each model, these significance-values will be uniformly distributed in the range [0:1]. You want to pick a weakest possible significance threshold that eliminates as many noise variables as possible. A moment's thought should convince you that a threshold of 1/n allows only one variable through, in expectation.

This leads to the general-case heuristic that a significance threshold of 1/n on your variables should allow only one irrelevant variable through, in expectation (along with all the relevant variables). Hence, 1/n used to be our recommended threshold, when we developed the R version of vtreat.

We noticed, however, that this biases the filtering against numerical variables, since there are at most two derived variables (of types clean_copy and missing_indicator for every numerical variable in the original data. Categorical variables, on the other hand, are expanded to many derived variables: several indicators (one for every common level), plus a logit_code and a prevalence_code. So we now reweight the thresholds.

Suppose you have a (treated) data set with ntreat different types of vtreat variables (clean_copy, indicator_code, etc). There are nT variables of type T. Then the default threshold for all the variables of type T is 1/(ntreat nT). This reweighting helps to reduce the bias against any particular type of variable. The heuristic is still that the set of recommended variables will allow at most one noise variable into the set of candidate variables.

As noted above, because vtreat estimates variable significances using linear methods by default, some variables with a non-linear relationship to the output may fail to pass the threshold. Setting the filter_to_recommended parameter to False will keep all derived variables in the treated frame, for the data scientist to filter (or not) as they will.

Conclusion

In all cases (classification, regression, unsupervised, and multinomial classification) the intent is that vtreat transforms are essentially one liners.

The preparation commands are organized as follows:

Regression: Python regression example, R regression example, fit/prepare interface, R regression example, design/prepare/experiment interface.
Classification: Python classification example, R classification example, fit/prepare interface, R classification example, design/prepare/experiment interface.
Unsupervised tasks: Python unsupervised example, R unsupervised example, fit/prepare interface, R unsupervised example, design/prepare/experiment interface.
Multinomial classification: Python multinomial classification example, R multinomial classification example, fit/prepare interface, R multinomial classification example, design/prepare/experiment interface.

Some vtreat common capabilities are documented here:

Score Frame score_frame_, using the score_frame_ information.
Cross Validation Customized Cross Plans, controlling the cross validation plan.

These current revisions of the examples are designed to be small, yet complete. So as a set they have some overlap, but the user can rely mostly on a single example for a single task type.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MultinomialExample.md

MultinomialExample.md

Using vtreat with Multinomial Classification Problems

Preliminaries

Some quick data exploration

Build a transform appropriate for classification problems.

Examining variables

Using the Prepared Data in a Model

Parameters for `MultinomialOutcomeTreatment`

Example: Use all variables to model, not just recommended

Types of prepared variables

Example: Produce only a subset of variable types

Deriving the Default Thresholds

Conclusion

Files

MultinomialExample.md

Latest commit

History

MultinomialExample.md

File metadata and controls

Using vtreat with Multinomial Classification Problems

Preliminaries

Some quick data exploration

Build a transform appropriate for classification problems.

Examining variables

Using the Prepared Data in a Model

Parameters for MultinomialOutcomeTreatment

Example: Use all variables to model, not just recommended

Types of prepared variables

Example: Produce only a subset of variable types

Deriving the Default Thresholds

Conclusion

Parameters for `MultinomialOutcomeTreatment`