Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ Fixes: #2 ] Training and Testing a Classification Model- Ensemble Method (Forests of Randomized Trees) on defaults.csv #53

Merged
merged 4 commits into from
Mar 20, 2020

Conversation

iamarchisha
Copy link
Contributor

[ Fixes #2 ]
Ensemble method- Forests of Randomized Trees under 'sklearn' has been implemented on defaults.csv
It includes the following:

  • Exploratory data analysis

  • Data pre-processing

  • Hyper-parameter tuning of the model (done manually through experimentation)

  • Fitting and Prediction on train-test data

  • Computing evaluating metrics

Note: In pre-processing KMeans clustering has also been implemented to achieve better results.

The classification report looks as follows:
image

pandas_profiling is an extra library that has been added to the environment and it is of great advantage which are stated below:

For each column the following statistics - if relevant for the column type - are presented in an interactive HTML report:

  1. Type inference: detect the types of columns in a dataframe.
  2. Essentials: type, unique values, missing values
  3. Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
  4. Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
  5. Most frequent values
  6. Histogram
  7. Correlations highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
  8. Missing values matrix, count, heatmap and dendrogram of missing values
  9. Text analysis learn about categories (Uppercase, Space), scripts (Latin, Cyrillic) and blocks (ASCII) of text data.

Copy link
Contributor

@dzeber dzeber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this excellent PR! Your notebook is nicely presented and well thought-through, and your module code is very clear. I was especially interested to see your approach to dimensionality reduction using clustering - that is a very promising approach. Great job!

That said, while your code is very well documented, I would really like to see more of the narrative describing your thought process and interpretations in the notebook. For example, you ran the detailed profile report on the dataset that produced extensive output - how did this influence your approach to building the model. Why did you decide to do the dimensionality reduction step through K-means clustering? What is your interpretation of the final classification report.

Also, please add comments in the notebook motivating all of the parameter choices you used: train-test split proportion, cluster sizes, RF params, etc.

dev/archisha-chandel/defaults_modules.py Outdated Show resolved Hide resolved
@iamarchisha
Copy link
Contributor Author

I have added details and reasoning of the models, functions and hyper-parameters used.

@iamarchisha iamarchisha requested a review from dzeber March 14, 2020 07:05
@mlopatka mlopatka merged commit 3734dd3 into mozilla:master Mar 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Outreachy applications] Startup task: Train and test a classification model
3 participants