This notebook covers credit card fraud classification.
NOTE: The dataset is too large to be uploaded here, but you can get it from https://www.kaggle.com/mlg-ulb/creditcardfraud
It's split into 5 sections:
- Data preparation and interpretation
- Data preprocessing
- Exploratory data analysis
- Machine learning classification
- Neural network classification
- Conclusion
This README covers the best methods used in this notebook, though more are covered.
First the data is found to be extremely unbalanced like so:
The data is balanced using SMOTE to achieve equal outcome.
This has a great effect on the correlation between features and class, comparing the top correlation plot to the last.
Next the outliers are removed from the fraudulent datapoints to increase model accuracy.
Before:
After:
Features of the data are explored, starting with higher order correlations.
Revealing some quadratic relationships to be tested with models later on. Next the data is dimensionally reduced and plotted to see differentiation between fraud and non-fraud cases.
The plots show that while mostly separated there is come overlap, with the SMOTE data magnifying this.
NOTE: Since the data is now equally distributed between classes, the standard accuracy metric is perfectly acceptable (this is not the case for imbalanced datasets).
The data is modelled and the best outcome KNN achieves 99.87% accuracy and is graphed showing a nice result, all fraudulent cases are correctly classified.
The KNN model being the best predicter is then optimised increasing the accuracy to 99.96%
A neural network model is created, care has been taken to make the model complex enough to distinguish the large and varied dataset produced, I found underfitting easy to achieve. Unfortunately even when using a large model which with my limited computing power takes 45 minutes to train I was only able to produce a 99.7% accuracy score, producing the following confusion matrix:
Somewhat dissapointing given the extra work that went into the neural net. The confusion matrix also shows that some of the fraudulent cases are missed by the classifier.
In conclusion a well optimised KNN algorithm approach proved to be by far the best predictor of credit card fraud using a large SMOTE balanced dataset.
Check out the notebook to clear any details up or make use of the implementations.