Analysis of credit card cardholders' background using Machine Learning on both quantitative and qualitative responses. This dataset contains information on default payments, demographic factors, credit data, history of payment, and bill statements of credit card clients in Taiwan from April 2005 to September 2005.
To find out the best fit algorithm for amount of given credit in NT dollars against other factors, which are more important variables using Bayesian information criterion.
To generate insights of "Default of Credit Card Clients Dataset", the first step is Exploratory Data Analysis. I performed initial investigations on the data and summarized statistics and graphical representations using ggplot2
in R.
Machine learning on amount of given credit in NT dollars against other factors.
There are 24 factors against amount of given credit. In order to aviod overfitting, I selected the most important factors using forward stepwise selection and chose Bayesian Information Criterion (BIC) for determining the cross-validated prediction error. The Bayesian Information Criterion (BIC) gives unnecessary variable much greater penalty, so it can more efficient to aviod overfitting.
Five algorithms and the libraries used:
Algorithms | Libraries |
---|---|
Linear regression | lm |
Regularized generalized linear models | glmnet |
Classification tree | tree |
Bagging Approach - Bootstrap aggregating | randomForest |
Boosting | gbm |
Machine learning on whether the payment defaults next month against other factors.
And then, I did the same process on whether clients default payment next month against others factors.
Seven algorithms and the libraries used:
Algorithms | Libraries |
---|---|
Generalized linear model | glm |
Linear discriminant analysis - LDA | MASS |
Quadratic discriminant analysis - QDA | MASS |
K-nearest neighbors | class |
Generalized additive model | gam |
Classification tree | tree |
Regularized generalized linear models | glmnet |
-
More credit card defualt for limit balance about 10000. It might mean that credit card might be too easy to be issued for people who have low credit scores. The variance of the default rate for limit balance over 500,000 NTD is higher than other range of limit balance.
-
It is lower default rate for cardholders have higher education level. Moreover, the default rate for clients whose age over 60 was higher than mid age and young people.
-
The best fit algorithm for predicting limit balance is bagging approach.
-
The best fit algorithm for predicting whether a client default next month is classification tree.
Yeh, I. C., & Lien, C. H. (2009). The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications, 36(2), 2473-2480.