[ Fixes: #2 ] Training and Testing a Classification Model- Ensemble Method (Forests of Randomized Trees) on defaults.csv #53

iamarchisha · 2020-03-12T15:49:44Z

[ Fixes #2 ]
Ensemble method- Forests of Randomized Trees under 'sklearn' has been implemented on defaults.csv
It includes the following:

Exploratory data analysis
Data pre-processing
Hyper-parameter tuning of the model (done manually through experimentation)
Fitting and Prediction on train-test data
Computing evaluating metrics

Note: In pre-processing KMeans clustering has also been implemented to achieve better results.

The classification report looks as follows:

pandas_profiling is an extra library that has been added to the environment and it is of great advantage which are stated below:

For each column the following statistics - if relevant for the column type - are presented in an interactive HTML report:

Type inference: detect the types of columns in a dataframe.
Essentials: type, unique values, missing values
Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
Most frequent values
Histogram
Correlations highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
Missing values matrix, count, heatmap and dendrogram of missing values
Text analysis learn about categories (Uppercase, Space), scripts (Latin, Cyrillic) and blocks (ASCII) of text data.

added new line at the end of file

dzeber

Thanks for this excellent PR! Your notebook is nicely presented and well thought-through, and your module code is very clear. I was especially interested to see your approach to dimensionality reduction using clustering - that is a very promising approach. Great job!

That said, while your code is very well documented, I would really like to see more of the narrative describing your thought process and interpretations in the notebook. For example, you ran the detailed profile report on the dataset that produced extensive output - how did this influence your approach to building the model. Why did you decide to do the dimensionality reduction step through K-means clustering? What is your interpretation of the final classification report.

Also, please add comments in the notebook motivating all of the parameter choices you used: train-test split proportion, cluster sizes, RF params, etc.

dev/archisha-chandel/defaults_modules.py

iamarchisha · 2020-03-14T07:05:05Z

I have added details and reasoning of the models, functions and hyper-parameters used.

iamarchisha and others added 2 commits March 12, 2020 21:00

adds .ipynb, .py for defaults.csv and updates environment.yml

6c27427

Update defaults_modules.py

29b6a65

added new line at the end of file

dzeber suggested changes Mar 13, 2020

View reviewed changes

dev/archisha-chandel/defaults_modules.py Outdated Show resolved Hide resolved

iamarchisha added 2 commits March 14, 2020 12:03

makes requested changes

e37ab56

makes requested changes

7e1249a

iamarchisha requested a review from dzeber March 14, 2020 07:05

mlopatka merged commit 3734dd3 into mozilla:master Mar 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ Fixes: #2 ] Training and Testing a Classification Model- Ensemble Method (Forests of Randomized Trees) on defaults.csv #53

[ Fixes: #2 ] Training and Testing a Classification Model- Ensemble Method (Forests of Randomized Trees) on defaults.csv #53

iamarchisha commented Mar 12, 2020

dzeber left a comment

iamarchisha commented Mar 14, 2020

[ Fixes: #2 ] Training and Testing a Classification Model- Ensemble Method (Forests of Randomized Trees) on defaults.csv #53

[ Fixes: #2 ] Training and Testing a Classification Model- Ensemble Method (Forests of Randomized Trees) on defaults.csv #53

Conversation

iamarchisha commented Mar 12, 2020

dzeber left a comment

Choose a reason for hiding this comment

iamarchisha commented Mar 14, 2020