Using the sklearn Random Forest Library, I have created a model that predicts whether a users Credit Score is 'Poor', 'Standard' or 'Good' using a dataset that I found on Kaggle.
This dataset was very messy, so a lot of cleaning needed to be done!
I realised that I only needed to read in the training dataset at first, and then just apply all those same changes to the test set when I needed to test the model.
To gain an overview of the dataset, I used 4 common commands:
head
to visualise the first few rows of the datasetshape
to see how many rows and columns there are in the dataframedescribe
to see the key statistics of each featureinfo
to see the data types of features and other details
I also used .isna().sum()
to count the number of NaN rows. For rows that had loads of NaN entries, I simply removed these from my analysis as they weren't of any use.
Below are the rows that I removed along with reasoning:
ID
,Customer_ID
,Month
,Name
,SSN
,Type_of_Loan
: removed not due to data issues, but because the features are not good predictors of someone's credit scoreMonthly_Inhand_Salary
,Num_of_Delayed_Payment
,Num_Credit_Inquiries
,Credit_History_Age
,Amount_invested_monthly
,Monthly_Balance
as they contained NaN rows. It wasn't clear whether to interpret NaN as 0 or missing data, so it wasn't fair to include them in training. I could've considered interpolation, but I wanted the training data to be as realistic as possible to get an accurate model.Payment_of_Min_Amount
as I didn't know how to format bool values for RandomForest
From a first glance, key issues that I noticed with entries in the CSV included underscores in numerical data which was easily fixed using str.replace()
. Also, rows with negative values were removed as this is not possible, as well as duplicate rows.
Another issue was pandas reading in the columns as incorrect data types, so I had to manually define each dtype
.
An issue that I ran into using this library with this specific dataset is that non-numerical columns are not directly supported. Following some StackOverflow searching, I found that the best solution to this problem is converting such columns to labels using sklearn's Label Encoder.
This was causing issues (with OneHotEncoder too!) so have removed categorical data for now.