#Project process#
##Take R as the tool for this project###
-
Load the dataset. Package "readr"
-
can also use fread
-
read.csv is also fine, but would be slow, only for small dataset
-
convert the categorical data to numeric value, this is mainly for visualizing NAs. (optional for Xgboost). The code in R is below:
for (f in names(train)) { if (class(train[[f]])=="character") { levels <- unique(c(train[[f]], test[[f]])) train[[f]] <- factor(train[[f]], levels=levels) test[[f]] <- factor(test[[f]], levels=levels) } }
-
Visualizing NAs. Package "VIM", explore the structure of missing value. (Missing Not at Random)
- Below is the source code on Kaggle BNP script
-
Do some exploratory data analysis
-
Analysis of duplicate variables. However, this kind of manual method is not realistic if we have thousands of variables.
-
Imputation and Feature engineering
-
Find and remove redundant variables. Use correlation filter method to find some highly correlated variables. Correlation method should be used for numeric values.
```R
library(corrplot)
library(caret)
temp <- train.num[,-1:-2]
corr.Matrix <- cor(, use="pairwise.complete.obs") # mainly for NA values
corr.75 <- findCorrelation(corr.Matrix, cutoff = 0.75)
train.num.75 <- temp[, corr.75] # try different threshold 0.85 and 0.9
corrplot(corr.Matrix, order = "hclust")
```
-
Try various imputation methods * Imputation default value -1. This could be the baseline method. * Try to use KNNImpute * Imputation for categorical variable, how to do this in R * Optional for Amelia and Multiple Imputation. Do some research on Multiple imputation course
-
Use entropy based method to choose some related variables to target variable. This would take a long time because a heap memory limited in R. * information.gain * gain ratio * symmetrical.uncertainty
-
Read this paper to get deep understaning of feature selection. An Introduction to Variable and Feature Selection
-
Until now, there are several result of train dataset after data cleaning, imputation and feature selection.
-
The baseline preprocessing method is using all the variables and imputing -1 to NAs.
- Using missing value count per observation as a predictor