GitHub - fujunswufe/KaggleBNP: This project is for three students with love for data science

#Project process#

##Take R as the tool for this project###

Load the dataset. Package "readr"
can also use fread
read.csv is also fine, but would be slow, only for small dataset

convert the categorical data to numeric value, this is mainly for visualizing NAs. (optional for Xgboost). The code in R is below:

for (f in names(train)) {
  if (class(train[[f]])=="character") { 
    levels <- unique(c(train[[f]], test[[f]]))
    train[[f]] <- factor(train[[f]], levels=levels)
    test[[f]]  <- factor(test[[f]],  levels=levels)
  }
}

Visualizing NAs. Package "VIM", explore the structure of missing value. (Missing Not at Random)

Below is the source code on Kaggle BNP script

Do some exploratory data analysis
Analysis of duplicate variables. However, this kind of manual method is not realistic if we have thousands of variables.
Imputation and Feature engineering
Find and remove redundant variables. Use correlation filter method to find some highly correlated variables. Correlation method should be used for numeric values.

```R

library(corrplot)
library(caret)
temp <- train.num[,-1:-2]
corr.Matrix <- cor(, use="pairwise.complete.obs")  # mainly for NA values
corr.75 <- findCorrelation(corr.Matrix, cutoff = 0.75)
train.num.75 <- temp[, corr.75]  # try different threshold 0.85 and 0.9
corrplot(corr.Matrix, order = "hclust")

```

Try various imputation methods * Imputation default value -1. This could be the baseline method. * Try to use KNNImpute * Imputation for categorical variable, how to do this in R * Optional for Amelia and Multiple Imputation. Do some research on Multiple imputation course
Use entropy based method to choose some related variables to target variable. This would take a long time because a heap memory limited in R. * information.gain * gain ratio * symmetrical.uncertainty
Read this paper to get deep understaning of feature selection. An Introduction to Variable and Feature Selection
Until now, there are several result of train dataset after data cleaning, imputation and feature selection.
The baseline preprocessing method is using all the variables and imputing -1 to NAs.

Kaggle Forum##

Using missing value count per observation as a predictor

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
Preprocessing		Preprocessing
.Rapp.history		.Rapp.history
.Rhistory		.Rhistory
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
countNa.r		countNa.r

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kaggle Forum##

About

Releases

Packages

Languages

License

fujunswufe/KaggleBNP

Folders and files

Latest commit

History

Repository files navigation

Kaggle Forum##

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages