Skip to content

This project is an extensive dive into lending data collected by the SF Lending Club. The goal of the project was to make and train a model that can predict whether or not an individual is likely to fully pay their loan or have it charged off.

Notifications You must be signed in to change notification settings

epiacentini/SFLending_Club_Loan_Charge_Off_Predictor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 

Repository files navigation

SFLending_Club_Loan_Charge_Off_Predictor

IMPORTANT NOTE:

If the notebook is not loading in Github it can alternatively be viewed on this link via nbviewer. https://nbviewer.jupyter.org/github/epiacentini/SFLending_Club_Loan_Charge_Off_Predictor/blob/main/Loan_Payoff_Neural_Net.ipynb

This project is an extensive dive into lending data collected by the SF Lending Club. The goal of the project was to make and train a model that can predict whether or not an individual is likely to fully pay their loan or have it charged off. An effective and accurate model can expedite the decision making process of handing out loans. The process normally is likely to be done manually and leaves room for inaccuracies in calculation and potential human error. Not only that but it is going to be much slower than a fully automated process conducted by a machine learning approach.

Outside of the model this process also includes an exploration into the large dataset to find trends and correlations. Cleaning and processing the dataset to make sure it is compatible with the machine learning model. Feature engineering is also a part whereby I can find columns/features in the dataset that in their current state might not mean anything but with the help of python and pandas can be transformed into useful information.

A large portion of the work needed to be done for this project involved cleaning the data. A lot of the different features had missing values that all didn't have easy fixes. The first problem I ran into was with the employment length and employment title featuers. There were thousands of different employment titles and it would not be worth the time to create a column for each dummy variable so the feature had to be removed. When I investigated the predictive power of the employment length feature it seemed that loan charge-off rates were almost identical for each one so there was no benefit to keeping this feature as such it was also dropped form the dataframe. There are several other features that had to dropped because they would add little to no value, some examples of these are the Grade feature and the Title category. Unlike the case of employment title, most of the categorical variables could simply be converted to dummy variables as a numeric representation. Examples of these features were sub grade, zip code, verification status, application type, initial list status and purpose. One of the most troublesome features was the mort_acc (mortgage accounts). Nearly 10% of the mort_acc values were missing which would be a lot of rows to discard. Instead, I think the best approach would find a way to fill the values. Now, I could have just taken the average value of all mort_acc entires and just filled those in but there is a better way of going about this. After looking at the closest correlated features to mort_acc we can discover that it relates closely to the total_acc feature. Using this information I was able to find the average mort_acc for each value of total_acc (a column that was missing no values). This way for every row with a missing mort_acc value we can turn to that indviduals total_acc value and use that to come up with a mort_acc value. Once all the data was prepped and selected it had to be scaled using sklearns pre-built scaler function.

Lastly, the model itself was constructed using Tensorflow and Keras. The model has four hidden layers with a size starting at 78 and being halved each time. 78 represents the inital shape of the input. All of the hidden layer activation functions use the relu and the output layer uses a sigmoid (works well with binary classification). The dropout layers cause 20% of the nodes to turn off and cal help with avoiding overfitting the training data. Regardless of the specified epochs the EarlyStopping callback will monitor loss after each epoch and if it notices an increase of loss it will halt the training. The patience parameter specifies how many times a loss increase has to be noticed before stopping. The model was trained and ran on the data resulting in approximately 90% accuracy. For good measure I selected a random value from the dataset and use the predict class functioniality to see if the model would recommend giving this individual a loan. The result was a 1 and when compared with that individuals target value they lined up and in fact that individual was one who was given a loan and paid it off.

Programming Languages: Python

Tools/Libraries: TensorFlow, NumPy, Matplotlib, Seaborn, SciKit-Learn(SKLearn), Jupyter Notebook

About

This project is an extensive dive into lending data collected by the SF Lending Club. The goal of the project was to make and train a model that can predict whether or not an individual is likely to fully pay their loan or have it charged off.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published