Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Outreachy applications] Traversal of the space of train/test splits #3

Closed
dzeber opened this issue Mar 4, 2020 · 9 comments · Fixed by #46
Closed

[Outreachy applications] Traversal of the space of train/test splits #3

dzeber opened this issue Mar 4, 2020 · 9 comments · Fixed by #46

Comments

@dzeber
Copy link
Contributor

dzeber commented Mar 4, 2020

Given a classification model, we want to investigate how much the performance score computed on the test set depends on the choice of train/test split proportion. Eg. how would our performance estimate change if we used a 60/40 split rather than 80/20?

Write a function that takes a scikit-learn estimator and a dataset, and computes an evaluation metric over a grid of train/test split proportions from 0 to 100%. To assess variability, for each split proportion it should resplit and recompute the metric multiple times. It should output a table of splits with multiple metric values per split.

@Addi-11
Copy link
Contributor

Addi-11 commented Mar 7, 2020

Hello, I would like to work on this issue.

@Addi-11
Copy link
Contributor

Addi-11 commented Mar 7, 2020

splits
Is this the requirement ??

@Addi-11
Copy link
Contributor

Addi-11 commented Mar 7, 2020

I am working on the tabulated form and including graphs too. Is anything else required??

Addi-11 added a commit to Addi-11/PRESC that referenced this issue Mar 7, 2020
@Addi-11
Copy link
Contributor

Addi-11 commented Mar 7, 2020

I have submitted a PR regaring this issue, kindly review.

@shashigharti
Copy link
Contributor

I will work on this issue

@dzeber
Copy link
Contributor Author

dzeber commented Mar 11, 2020

@Addi-11 I saw your PR, we can discuss further there. Yes, the requirement is a function that returns the tabular form.

@asthad16
Copy link
Contributor

i will work on this issue

@alberginia
Copy link
Collaborator

alberginia commented Mar 17, 2020

Hi! Yesterday after my pull request I realised that my solution for issue #2 is actually also addressing this one. I have no experience with git, so I have no clue on how to relate the two issues or how should I proceed so that the pull request is also connected to this issue here.

mlopatka pushed a commit that referenced this issue Mar 20, 2020
* visual for eeg

* code restructured

* #3 data-split space mapped

* tabulated relation btw k and evaluation metrics

* gain-lift charts of models

* interprtation added
@mlopatka mlopatka reopened this Mar 20, 2020
mlopatka pushed a commit that referenced this issue Mar 20, 2020
* visual for eeg

* code restructured

* #3 data-split space mapped

* fixes issue3

* studied data splits for all classifiers

* added graph in the loop

* docstrings added

* validation sets added

* formatting

* evaluated all classifiers

* compared models

* result added

* calibration plot added

* docstrings
Bolaji61 added a commit to Bolaji61/PRESC that referenced this issue Mar 22, 2020
Bolaji61 added a commit to Bolaji61/PRESC that referenced this issue Mar 25, 2020
asthad16 referenced this issue in asthad16/PRESC Mar 25, 2020
these committed changes fixes issue #3 of traversal space of train-test splits using KNN model.in #2 i have used decision tree and further recommended outlier detection algorithm for classification. so in this PR i have used KNN and compared results with previous classfication.this PR uses already defined modules in #2.
asthad16 referenced this issue in asthad16/PRESC Mar 25, 2020
@asthad16
Copy link
Contributor

i have worked on the issue #3. i request u to please review my PR #122

asthad16 referenced this issue in asthad16/PRESC Mar 25, 2020
these committed changes fix issue#4 space traversal of k-fold. in this the obtained hyper parameter tuned model from PR  for #3 is used in KNN model and k-fold as well as its variant stratified k-fold is used for accuracy evaluation of the classification by KNN model by varying the no. of folds. the mean_score is used as evaluation metric.
mlopatka pushed a commit that referenced this issue Mar 26, 2020
* visual for eeg

* code restructured

* #3 data-split space mapped

* fixes issue3

* studied data splits for all classifiers

* added graph in the loop

* docstrings added

* validation sets added

* formatting

* evaluated all classifiers

* compared models

* result added

* indexed

* removed plot-recall-curve

* learning-curve

* added models

* env refresh

* final estimate added

* black formats

* conclusion added
mlopatka added a commit that referenced this issue Mar 26, 2020
* visual for eeg

* code restructured

* #3 data-split space mapped

* tabulated relation btw k and evaluation metrics

* gain-lift charts of models

* auc-roc implemented

* fixes issue3

* studied data splits for all classifiers

* added graph in the loop

* docstrings added

* validation sets added

* formatting

* evaluated all classifiers

* compared models

* result added

* interprtation added

* docstring, interpretation added

* indexed

* removed plot-recall-curve

* shorten PR

* conflict resolve

Co-authored-by: mlopatka <[email protected]>
mlopatka added a commit that referenced this issue Mar 27, 2020
* Update .gitignore

* Preliminary Analysis

* Helper modules (Bar and Hist graph)

* Rough KNN algorithm implemented

* Delete libraries.py

* KNN classifier refactored and polished

Returns only variable of intests for use the metrics calculations.

* refactored for performance

just the required functions imported

* draft mlp classifier implemented

to be reviewed

* ...

* Threshold conversion logic implemented

Since knn.predict calculates a probability, we implement a logic for binary classification

* Prelimary cleaning and knn model classification implemented!

* Adjusted plor error with title placement

* ...

* Files reformated with 'Black'

* Logistic Regression classifier

* Refactores modules to improve modularity

* Implemented Log Reg

* Deleted mpl module to focus on knn and log reg

* Refactors gotignore to my personal folder

* refactored for readability

* Implementation to add counts and relative percentages on bars graph

* Refactored name #2, Completed Prelimary Analysis and Interpreted Results

* Update Issue #2 - Train and test a classification model (PRESC).ipynb

* Files reformated with 'Black'

* Display Error corrected

* Interpreted choice of hyper-parameters

* Refactored and Added Modules used for Issue 3

* Prelimanry Analysis - Traversal of the space of train_test splits

* Issue#3 complete

* Removed Issues #2 and #3 ipynb

* Issue #4 - completed

Issue #4 - Traversal of the space of cross-validation folds

* Delete defaults_data.csv

Removing duplication of the existing data set which can be loaded from the repos root directory.

Co-authored-by: mlopatka <[email protected]>
mlopatka pushed a commit that referenced this issue Mar 27, 2020
* fixes #8

* fixes #4, attempt 1

* updated missclassification graph and brokedown functions

* first attempt to fix # 3

* implemeneted all change requests

* formatted code for all helper files

* minor fix

* fixed code formatting issues and  removed extra file

* fixed code formatting, added docstring to func

* fixed relative path

* fixed all changes requested

* fixed relative path in notebook

* fixing conflict with some file changes

* fixing attempt last for conflicts
@mlopatka mlopatka reopened this Mar 27, 2020
@mlopatka mlopatka reopened this Mar 30, 2020
mlopatka pushed a commit that referenced this issue Mar 30, 2020
* visual for eeg

* code restructured

* #3 data-split space mapped

* fixes issue3

* studied data splits for all classifiers

* added graph in the loop

* docstrings added

* validation sets added

* formatting

* evaluated all classifiers

* compared models

* result added

* indexed

* removed plot-recall-curve

* env refresh

* final estimate added
dzeber pushed a commit that referenced this issue Apr 2, 2020
arizzogithub added a commit to arizzogithub/PRESC that referenced this issue Apr 3, 2020
arizzogithub added a commit to arizzogithub/PRESC that referenced this issue Apr 3, 2020
arizzogithub added a commit to arizzogithub/PRESC that referenced this issue Apr 3, 2020
arizzogithub added a commit to arizzogithub/PRESC that referenced this issue Apr 3, 2020
arizzogithub added a commit to arizzogithub/PRESC that referenced this issue Apr 3, 2020
arizzogithub added a commit to arizzogithub/PRESC that referenced this issue Apr 3, 2020
mlopatka pushed a commit that referenced this issue Apr 8, 2020
* #7 Visualization for misclassification

* Comparing test sample classifications between models

I compared the random forest and k nearest neighbors classifier models and used a barchart to visualize the classification of the test set

* added probability to misclasification visualization

* new misclassification visualization method used

* moved into misclassification_visualization folder

* moved to misclassification visualization folder

* Traversal of the space of train-test splits

* fixed file path and did better visualization

* Update #7 visualization for misclassifications.ipynb

* Update misclassification_function.py

* made changes to #7

* Delete Traversal of the space of train-test splits #3.ipynb

* Delete traversal_function.py

* Traversal of the space of train-test splits #3
mlopatka pushed a commit that referenced this issue Jul 13, 2020
* visual for eeg

* code restructured

* #3 data-split space mapped

* fixes issue3

* studied data splits for all classifiers

* added graph in the loop

* docstrings added

* validation sets added

* formatting

* evaluated all classifiers

* compared models

* result added

* indexed

* removed plot-recall-curve

* env refresh

* final estimate added

* method1

* method1-complete

* formats
@dzeber dzeber changed the title Traversal of the space of train/test splits [Outreachy applications] Traversal of the space of train/test splits Jul 13, 2020
@dzeber dzeber closed this as completed Jul 14, 2020
arizzogithub added a commit to arizzogithub/PRESC that referenced this issue Aug 27, 2020
arizzogithub added a commit to arizzogithub/PRESC that referenced this issue Aug 27, 2020
arizzogithub added a commit to arizzogithub/PRESC that referenced this issue Jul 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants