- Step 1: Read the data
X_full = pd.read_csv('../input/train.csv', index_col='Id')
X_test_full = pd.read_csv('../input/test.csv', index_col='Id')
# Obtain target and predictors
y = X_full["target"]
X = X_full[:-1].copy() #X will not include last column, which is "target" column
X_test = X_test_full.copy()
- Step 2: Break off validation set from training data
X
# Break off validation set from training data
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)
- Step 3: Comparing different models
models = [model_1, model_2, model_3, model_4, model_5]
# Function for comparing different models
def score_model(model, X_t=X_train, X_v=X_valid, y_t=y_train, y_v=y_valid):
model.fit(X_t, y_t)
preds = model.predict(X_v)
return mean_absolute_error(y_v, preds)
for i in range(0, len(models)):
mae = score_model(models[i])
print("Model %d MAE: %d" % (i+1, mae))
# Get names of columns with missing values
cols_with_missing = [col for col in X_train.columns
if X_train[col].isnull().any()]
# Number of missing values in each column of training data
missing_val_count_by_column = (X_train.isnull().sum())
- Method 1: Drop Columns with Missing Values
- Method 2: Imputation
- Method 3: Extension To Imputation
# Drop columns in training and validation data
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_valid = X_valid.drop(cols_with_missing, axis=1)
Imputation
fills in the missing values with some number.strategy = “mean”, "median"
for numerical columnstrategy = “most_frequent”
for object (categorical) column
from sklearn.impute import SimpleImputer
# Imputation
my_imputer = SimpleImputer(missing_values=np.nan, strategy="mean")
#Only fit on training data
my_imputer.fit(X_train)
imputed_X_train = pd.DataFrame(my_imputer.transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))
# Fill in the lines below: imputation removed column names; put them back
imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns
- Imputation is the standard approach, and it usually works well.
- However, imputed values may be systematically above or below their actual values (which weren't collected in the dataset). Or rows with missing values may be unique in some other way.
- In that case, your model would make better predictions by considering which values were originally missing.
- Note: In some cases, this will meaningfully improve results. In other cases, it doesn't help at all.
# Make copy to avoid changing original data (when imputing)
X_train_plus = X_train.copy()
X_valid_plus = X_valid.copy()
# Make new columns indicating what will be imputed
for col in cols_with_missing:
X_train_plus[col + '_was_missing'] = X_train_plus[col].isnull()
X_valid_plus[col + '_was_missing'] = X_valid_plus[col].isnull()
# Imputation
my_imputer = SimpleImputer()
imputed_X_train_plus = pd.DataFrame(my_imputer.fit_transform(X_train_plus))
imputed_X_valid_plus = pd.DataFrame(my_imputer.transform(X_valid_plus))
# Imputation removed column names; put them back
imputed_X_train_plus.columns = X_train_plus.columns
imputed_X_valid_plus.columns = X_valid_plus.columns
- There are 4 types of Categorical variable
Nominal
: non-order variables like "Honda", "Toyota", and "Ford"Ordinal
: the order is important- For tree-based models (like decision trees and random forests), you can expect ordinal encoding to work well with ordinal variables
Label Encoder
→ can map to 1,2,3,4, etc → Use Tree-based Models: Random Forest, GBM, XGBoostBinary Encoder
→ binary-presentation vectors of 1,2,3,4, etc values → Use Logistic and Linear Regression, SVM
Binary
: only have 2 values (Female, Male)Cyclic
: Monday, Tuesday, Wednesday, Thursday
- Determine Categorical Columns:
# Categorical columns in the training data
object_cols = [col for col in X_train.columns if X_train[col].dtype == "object"]
- Filter Good & Problematic Categorical Columns which will affect Encoding Procedure:
- For example: Unique values in Train Data are different from Unique values in Valid Data → Solution: ensure values in
Valid Data
is a subset of values inTrain Data
- The simplest approach, however, is to drop the problematic categorical columns.
- For example: Unique values in Train Data are different from Unique values in Valid Data → Solution: ensure values in
# Categorical columns in the training data
object_cols = [col for col in X_train.columns if X_train[col].dtype == "object"]
# Columns that can be safely ordinal encoded
good_label_cols = [col for col in object_cols if
set(X_valid[col]).issubset(set(X_train[col]))]
# Problematic columns that will be dropped from the dataset
bad_label_cols = list(set(object_cols)-set(good_label_cols))
print('Categorical columns that will be ordinal encoded:', good_label_cols)
print('\nCategorical columns that will be dropped from the dataset:', bad_label_cols)
- The simplest approach, however, is to drop the problematic categorical columns.
# Drop categorical columns that will not be encoded
label_X_train = X_train.drop(bad_label_cols, axis=1)
label_X_valid = X_valid.drop(bad_label_cols, axis=1)
There are 5 methods to encode Categorical variables
- Method 1: Drop Categorical Variables
- Method 2: Ordinal Encoding
- Method 3: Label Encoding (Same as Ordinal Encoder but NOT care about the order)
- Method 4: One-Hot Encoding
- Method 5: Entity Embedding (Need to learn from Video: https://youtu.be/EATAM3BOD_E)
- This approach will only work well if the columns did not contain useful information.
drop_X_train = X_train.select_dtypes(exclude=['object'])
drop_X_valid = X_valid.select_dtypes(exclude=['object'])
- This approach assumes an ordering of the categories: "Never" (0) < "Rarely" (1) < "Most days" (2) < "Every day" (3).
from sklearn.preprocessing import OrdinalEncoder
# Apply ordinal encoder
ordinal_encoder = OrdinalEncoder() # Your code here
ordinal_encoder.fit(label_X_train[good_label_cols])
label_X_train[good_label_cols] = ordinal_encoder.transform(label_X_train[good_label_cols])
label_X_valid[good_label_cols] = ordinal_encoder.transform(label_X_valid[good_label_cols])
- Same as Ordinal Encoder but NOT care about the order, but follow by Alphabet of the values
Label Encoder
need to fit in each column separately
from sklearn.preprocessing import LabelEncoder
# Drop categorical columns that will not be encoded
label_X_train = X_train.drop(bad_label_cols, axis=1)
label_X_valid = X_valid.drop(bad_label_cols, axis=1)
for c in good_label_cols:
label_encoder = LabelEncoder()
label_encoder.fit(label_X_train[c])
label_X_train[c] = label_encoder.transform(label_X_train[c])
label_X_valid[c] = label_encoder.transform(label_X_valid[c])
Cardinality
: # of unique entries of a categorical variable- For instance, the
Street
column in the training data has two unique values:Grvl
andPave
, theStreet
col has cardinality 2
- For instance, the
- For large datasets with many rows, one-hot encoding can greatly expand the size of the dataset.
- Hence, we typically will only one-hot encode columns with relatively
low cardinality
. High cardinality
columns can either be dropped from the dataset, or we can use ordinal encoding.
# Columns that will be one-hot encoded
low_cardinality_cols = [col for col in object_cols if X_train[col].nunique() < 10]
# Columns that will be dropped from the dataset
high_cardinality_cols = list(set(object_cols)-set(low_cardinality_cols))
print('Categorical columns that will be one-hot encoded:', low_cardinality_cols)
print('\nCategorical columns that will be dropped from the dataset:', high_cardinality_cols)
- One-hot encoding generally does NOT perform well if the categorical variable has
cardinality >= 15
as One-Hot encoder will expand the original training data with increasing columns
- Set
handle_unknown='ignore'
to avoid errors when the validation data contains classes that aren't represented in the training data, and - Set
sparse=False
ensures that the encoded columns are returned as a numpy array (instead of a sparse matrix).
from sklearn.preprocessing import OneHotEncoder
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_encoder.fit(X_train[low_cardinality_cols])
OH_cols_train = pd.DataFrame(OH_encoder.transform(X_train[low_cardinality_cols])) #Convert back to Pandas DataFrame from Numpy Array
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[low_cardinality_cols]))
# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index
# Remove categorical columns in the original datasets (will replace with one-hot encoding columns)
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)
# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)
- Pipelines are a simple way to keep your data preprocessing and modeling code organized.
# Get current yticks: An array of the values displayed on the y-axis (150, 175, 200, etc.)
ticks = ax.get_yticks()
# Format those values into strings beginning with dollar sign
new_labels = [f"${int(tick)}" for tick in ticks]
# Set the new labels
ax.set_yticklabels(new_labels)
Models can suffer from either:
- Overfitting: capturing spurious patterns that won't recur in the future, leading to less accurate predictions
- Where a model matches the training data almost perfectly, but does poorly in validation and other new data.
- Underfitting: failing to capture relevant patterns, again leading to less accurate predictions.
- When a model fails to capture important distinctions and patterns in the data, so it performs poorly even in training data
max_leaf_nodes
argument provides a very sensible way to control overfitting vs underfitting. The more leaves we allow the model to make, the more we move from the underfitting area in the above graph to the overfitting area.
- We can use a utility function to help compare MAE scores from different values for
max_leaf_nodes
:
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
model.fit(train_X, train_y)
preds_val = model.predict(val_X)
mae = mean_absolute_error(val_y, preds_val)
return(mae)
- Call the get_mae function on each value of max_leaf_nodes. Store the output in some way that allows you to select the value of
max_leaf_nodes
that gives the most accurate model on your data.
scores = {leaf_size: get_mae(leaf_size, train_X, val_X, train_y, val_y) for leaf_size in candidate_max_leaf_nodes}
best_tree_size = min(scores.keys(), key=(lambda k: scores[k]))
- Evaluation Metric used for Competition usually will be specified in Kaggle Competition > Evaluation
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_pred, y_test)
import numpy as np
from sklearn.metrics import mean_squared_error
np.sqrt(mean_squared_error(y_pred, y_test))
- The goal of
ensemble methods
is to combine the predictions of several base estimators built with a given learning algorithm in order to improve generalizability / robustness over a single estimator (for classification, regression and anomaly detection) - Two families of ensemble methods:
- In averaging methods, the driving principle is to build several estimators independently and then to average their predictions. On average, the combined estimator is usually better than any of the single base estimator because its variance is reduced.
- Examples: Bagging methods, Forests of randomized trees, etc.
- In boosting methods, base estimators are built sequentially and one tries to reduce the bias of the combined estimator. The motivation is to combine several weak models to produce a powerful ensemble.
- Examples: AdaBoost, Gradient Tree Boosting, etc.
- In averaging methods, the driving principle is to build several estimators independently and then to average their predictions. On average, the combined estimator is usually better than any of the single base estimator because its variance is reduced.
- Decision trees leave you with a difficult decision.
- A deep tree with lots of leaves will overfit because each prediction is coming from historical data from only the few data at its leaf.
- But a shallow tree with few leaves will perform poorly because it fails to capture as many distinctions in the raw data.
- The random forest uses many trees, and it makes a prediction by averaging the predictions of each component tree.
- It generally has much better predictive accuracy than a single decision tree and it works well with default parameters.
#Example of Gradient Boosting - Regressor
from sklearn.ensemble import GradientBoostingRegressor
gbm_model = GradientBoostingRegressor(random_state=1, n_estimators=500)
gbm_model.fit(train_X, train_y)
gbm_val_predictions = gbm_model.predict(val_X)
gbm_val_rmse = np.sqrt(mean_squared_error(gbm_val_predictions, val_y))
predictions = model.predict(X_test)
output = pd.DataFrame({'id': test_data.id, 'Survived': predictions})
output.to_csv('my_submission.csv', index=False)