Skip to content

Latest commit

 

History

History
74 lines (45 loc) · 3.83 KB

File metadata and controls

74 lines (45 loc) · 3.83 KB

Credit-Card-Fraud-Classification

This notebook covers credit card fraud classification.

NOTE: The dataset is too large to be uploaded here, but you can get it from https://www.kaggle.com/mlg-ulb/creditcardfraud

It's split into 5 sections:

  1. Data preparation and interpretation
  2. Data preprocessing
  3. Exploratory data analysis
  4. Machine learning classification
  5. Neural network classification
  6. Conclusion

This README covers the best methods used in this notebook, though more are covered.

Data preparation and interpretation

First the data is found to be extremely unbalanced like so:

Alt text

Data preprocessing

The data is balanced using SMOTE to achieve equal outcome.

Alt text

This has a great effect on the correlation between features and class, comparing the top correlation plot to the last.

Alt text

Next the outliers are removed from the fraudulent datapoints to increase model accuracy.

Before:

Alt text

After:

Alt text

Exploratory data analysis

Features of the data are explored, starting with higher order correlations.

Alt text

Revealing some quadratic relationships to be tested with models later on. Next the data is dimensionally reduced and plotted to see differentiation between fraud and non-fraud cases.

Alt text

The plots show that while mostly separated there is come overlap, with the SMOTE data magnifying this.

Machine learning classification

NOTE: Since the data is now equally distributed between classes, the standard accuracy metric is perfectly acceptable (this is not the case for imbalanced datasets).

The data is modelled and the best outcome KNN achieves 99.87% accuracy and is graphed showing a nice result, all fraudulent cases are correctly classified.

Alt text

The KNN model being the best predicter is then optimised increasing the accuracy to 99.96%

Alt text

Neural network classification

A neural network model is created, care has been taken to make the model complex enough to distinguish the large and varied dataset produced, I found underfitting easy to achieve. Unfortunately even when using a large model which with my limited computing power takes 45 minutes to train I was only able to produce a 99.7% accuracy score, producing the following confusion matrix:

Alt text

Somewhat dissapointing given the extra work that went into the neural net. The confusion matrix also shows that some of the fraudulent cases are missed by the classifier.

Conclusion

In conclusion a well optimised KNN algorithm approach proved to be by far the best predictor of credit card fraud using a large SMOTE balanced dataset.

Check out the notebook to clear any details up or make use of the implementations.