Skip to content

This project is for three students with love for data science

License

Notifications You must be signed in to change notification settings

fujunswufe/KaggleBNP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

#Project process#

##Take R as the tool for this project###

  1. Load the dataset. Package "readr"

  2. can also use fread

  3. read.csv is also fine, but would be slow, only for small dataset

  4. convert the categorical data to numeric value, this is mainly for visualizing NAs. (optional for Xgboost). The code in R is below:

    for (f in names(train)) {
      if (class(train[[f]])=="character") { 
        levels <- unique(c(train[[f]], test[[f]]))
        train[[f]] <- factor(train[[f]], levels=levels)
        test[[f]]  <- factor(test[[f]],  levels=levels)
      }
    }
    
  5. Visualizing NAs. Package "VIM", explore the structure of missing value. (Missing Not at Random)

  1. Do some exploratory data analysis

  2. Analysis of duplicate variables. However, this kind of manual method is not realistic if we have thousands of variables.

  3. Imputation and Feature engineering

  4. Find and remove redundant variables. Use correlation filter method to find some highly correlated variables. Correlation method should be used for numeric values.

```R

library(corrplot)
library(caret)
temp <- train.num[,-1:-2]
corr.Matrix <- cor(, use="pairwise.complete.obs")  # mainly for NA values
corr.75 <- findCorrelation(corr.Matrix, cutoff = 0.75)
train.num.75 <- temp[, corr.75]  # try different threshold 0.85 and 0.9
corrplot(corr.Matrix, order = "hclust")

```
  1. Try various imputation methods * Imputation default value -1. This could be the baseline method. * Try to use KNNImpute * Imputation for categorical variable, how to do this in R * Optional for Amelia and Multiple Imputation. Do some research on Multiple imputation course

  2. Use entropy based method to choose some related variables to target variable. This would take a long time because a heap memory limited in R. * information.gain * gain ratio * symmetrical.uncertainty

  3. Read this paper to get deep understaning of feature selection. An Introduction to Variable and Feature Selection

  4. Until now, there are several result of train dataset after data cleaning, imputation and feature selection.

  5. The baseline preprocessing method is using all the variables and imputing -1 to NAs.

Kaggle Forum##

  1. Using missing value count per observation as a predictor

About

This project is for three students with love for data science

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages