Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ Fixes: #78 ] Covariate Data #80

Closed
wants to merge 74 commits into from

Conversation

iamarchisha
Copy link
Contributor

@iamarchisha iamarchisha commented Mar 18, 2020

[ Fixes: #78, #3 ]

  • StratifiedKFold and RandomForestClassifier has been used to detect covariates in data.
  • Improving performance by assigning importance weight using Density Ratio Estimation
  • Evaluation metric score before covariate analysis: 0.52
  • Evaluation metric score after covariate analysis: 0.96
  • False Negatives can be reduced using [ Fixes: #63 ]Learn from Misclassification #74
    image

KaairaGupta and others added 30 commits March 7, 2020 12:44
Updating KaairaGupta/master
1.  Added SVM classifier with outlier removal and hyperparameter tuning
2. Notebook is reused from issue#2
3. Added code to run multiple test/train split and test the accuracy of the model"
1.
Increased the number of loops for test-train cycle
1.
Increased the number of loops for test-train cycle
1. Updated test_train_split function
2. Updated function call in python notebook
1. Fixed Transformation Code
2. Added warnings
Addi-11 and others added 9 commits March 25, 2020 18:57
* visual for eeg

* code restructured

* mozilla#3 data-split space mapped

* fixes issue3

* studied data splits for all classifiers

* added graph in the loop

* docstrings added

* validation sets added

* formatting

* evaluated all classifiers

* compared models

* result added

* indexed

* removed plot-recall-curve

* learning-curve

* added models

* env refresh

* final estimate added

* black formats

* conclusion added
* visual for eeg

* code restructured

* mozilla#3 data-split space mapped

* tabulated relation btw k and evaluation metrics

* gain-lift charts of models

* auc-roc implemented

* fixes issue3

* studied data splits for all classifiers

* added graph in the loop

* docstrings added

* validation sets added

* formatting

* evaluated all classifiers

* compared models

* result added

* interprtation added

* docstring, interpretation added

* indexed

* removed plot-recall-curve

* shorten PR

* conflict resolve

Co-authored-by: mlopatka <[email protected]>
* Outreachy startup task
1. Added module with all functions toimport from jupyter notebook.
2. Added Jupyter Notebook with outputs.

* made the dataset path a variable agrument for the function

* Details about confusion matrix

* Added functions to see scatterplot and explained confusion matrix and kernel method.

* Removed redundant imports and cells.

* Used standard scalar to standardize data

* minor changes

* minor changes

* Added relative paths

* Added the dataset in the repo and python black formatting

* Changed path to relative path in repo
* Data Loaded from vehicles.csv

* Data visulaization and training model with ifferent algorithms

* Evaluation of model is done.

* Changed model from Logistic Regression to Support Vector Machine

At first attempt i used three differnet models but Logistic Regression , Support Vector Machine and Decision Tree, and the overall accuracy with LR was better than any other but with changing validation parameters in SVM classification , model accuacy increased from 82% to 88%.

* Delete train and test model-checkpoint.ipynb

* Changed file named.

* all python modules were added

* docstrings were added

* labels added in confusion matrix

* Histogram colors were changed into single color

* solved histogram issue

* Update modules.py

* changes made in histogram

* Update modules.py

* sorted histogram

* Update modules.py

* labels were added for confusion matrix

* Python Custom Modules were added

* Update Vehicle_Classifier.ipynb

* Update modules.py

* Update modules.py

* Update modules.py

* added labels in confusion matrix

* Update modules.py

* Update modules.py

* Update modules.py

* Update modules.py

* Update modules.py

* Requested Changes were made

* Update modules.py

* change categorical data into numerical data

* change areguments in LR model

* Update modules.py

* Update modules.py

* Update modules.py

* Update modules.py

* Update modules.py

* Update modules.py

* Update modules.py

* Update modules.py

* Update modules.py

* Update modules.py

* Update modules.py

* Update modules.py

* Update modules.py

* Update modules.py

* Update modules.py

* Update modules.py

* Update modules.py

* Update modules.py

* Update modules.py

* Requested changes were made

* Update modules.py

* Update modules.py

* Update modules.py

* Update Vehicle_Classifier.ipynb

* updates file

* Delete Untitled.ipynb

* Update modules.py

* code formatted using python Black

* Requested changes were made

* shifted classifier's code from modules.py to ModelEvaluation.py

* removed learning curves from file

* added function for model evaluation in Model Evaluation file

* updated svm and lr

* added comments

* added doc strings

* Update ModelEvaluation.py

* Update ModelEvaluation.py

* Update Vehicle_Classifier.ipynb

* Update modules.py

* added descriptions

* added interpretations of visualization

* creating another branch from master

* solving branching issues

* main visulaization file is added

* Added module for visualization of missclassification

* added docstring and reformatted to python black

* added interpretation of misclassification

* removed unnecessary comments

* changed name of file from VehicleClassifier to TrainTestSplit_Traversal

* custom module file of Train_Test_plit_Traversal is added

* version 2  of train_test_split_traversal is added with some changes plus code is reformatted with python black

* changed plot labels

* removed error

* Update TrainTest_Split_Traversal.py

* minor changes were done

* chnaged file names and custom module file of CrossValidationFold_Traversal is added

* added docstrings and resolved rrors

* all files are moved to folder

* added docstring and file reformatted to python black

* added comments
…ozilla#32)

* 1. Simple scatter plot
2. Violin and Box plots

* Added plots for visualising misclassifications for Logistic regression and SVM classification.

* Revert "1. Simple scatter plot"

This reverts commit cb26342.

* Removed redundant commits and updated notebook according to start up task

* Changed to csv file from the repo and updated notebook acc to PR mozilla#22

* Refactored code in module and added python black formatting

* 1. Added ROC curves with AUC
2. Formatting and refactoring.

* Minor changes
* Create Readme.md

* Create files for exploring issue mozilla#2

* Format using black

* Remove notebook from master

* Increase modularization

* create file for issue 6

* remove file added by mistake

* Create notebook for issue 6

* Re-upload to the right folder

* Delete file from the incorrect folder
* initial commit

* updated notebook template to get started

* added comment in notebook

* Added notebook for issue#5 - calibration plots

1. Added module to plot calibration curve
2. Notebook to read data and display the calibration plot

* fixed formatting
Copy link
Contributor

@mlopatka mlopatka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reviewed the notebook's rendered version in github, the progress looks good.
And while I appreciate you taking the initiative to use another dataset, this PR needs to be refactored to merge in to master.

The *.zip file is still included in the PR despite it's addition to the .gitignore. If you want to work with this data set please submit a PR that adds the extracted CSV into the repo's main dataset directory and then load it from that source as with other datasets.

iamarchisha and others added 14 commits March 27, 2020 14:33
* adds .ipynb and .py

* updates code by formatting
* Data Loaded from vehicles.csv

* Data visulaization and training model with ifferent algorithms

* Evaluation of model is done.

* Changed model from Logistic Regression to Support Vector Machine

At first attempt i used three differnet models but Logistic Regression , Support Vector Machine and Decision Tree, and the overall accuracy with LR was better than any other but with changing validation parameters in SVM classification , model accuacy increased from 82% to 88%.

* Delete train and test model-checkpoint.ipynb

* Changed file named.

* all python modules were added

* docstrings were added

* labels added in confusion matrix

* Histogram colors were changed into single color

* solved histogram issue

* Update modules.py

* changes made in histogram

* Update modules.py

* sorted histogram

* Update modules.py

* labels were added for confusion matrix

* Python Custom Modules were added

* Update Vehicle_Classifier.ipynb

* Update modules.py

* Update modules.py

* Update modules.py

* added labels in confusion matrix

* Update modules.py

* Update modules.py

* Update modules.py

* Update modules.py

* Update modules.py

* Requested Changes were made

* Update modules.py

* change categorical data into numerical data

* change areguments in LR model

* Update modules.py

* Update modules.py

* Update modules.py

* Update modules.py

* Update modules.py

* Update modules.py

* Update modules.py

* Update modules.py

* Update modules.py

* Update modules.py

* Update modules.py

* Update modules.py

* Update modules.py

* Update modules.py

* Update modules.py

* Update modules.py

* Update modules.py

* Update modules.py

* Update modules.py

* Requested changes were made

* Update modules.py

* Update modules.py

* Update modules.py

* Update Vehicle_Classifier.ipynb

* updates file

* Delete Untitled.ipynb

* Update modules.py

* code formatted using python Black

* Requested changes were made

* shifted classifier's code from modules.py to ModelEvaluation.py

* removed learning curves from file

* added function for model evaluation in Model Evaluation file

* updated svm and lr

* added comments

* added doc strings

* Update ModelEvaluation.py

* Update ModelEvaluation.py

* Update Vehicle_Classifier.ipynb

* Update modules.py

* added descriptions

* added interpretations of visualization

* creating another branch from master

* solving branching issues

* main visulaization file is added

* Added module for visualization of missclassification

* added docstring and reformatted to python black

* added interpretation of misclassification

* removed unnecessary comments

* changed name of file from VehicleClassifier to TrainTestSplit_Traversal

* custom module file of Train_Test_plit_Traversal is added

* version 2  of train_test_split_traversal is added with some changes plus code is reformatted with python black

* changed plot labels

* removed error

* Update TrainTest_Split_Traversal.py

* minor changes were done

* chnaged file names and custom module file of CrossValidationFold_Traversal is added

* added docstrings and resolved rrors

* all files are moved to folder

* added docstring and file reformatted to python black

* added comments

* addedd calibrationplot .py

* added docstrings

* Update Calibration plot.ipynb
* Update .gitignore

* Preliminary Analysis

* Helper modules (Bar and Hist graph)

* Rough KNN algorithm implemented

* Delete libraries.py

* KNN classifier refactored and polished

Returns only variable of intests for use the metrics calculations.

* refactored for performance

just the required functions imported

* draft mlp classifier implemented

to be reviewed

* ...

* Threshold conversion logic implemented

Since knn.predict calculates a probability, we implement a logic for binary classification

* Prelimary cleaning and knn model classification implemented!

* Adjusted plor error with title placement

* ...

* Files reformated with 'Black'

* Logistic Regression classifier

* Refactores modules to improve modularity

* Implemented Log Reg

* Deleted mpl module to focus on knn and log reg

* Refactors gotignore to my personal folder

* refactored for readability

* Implementation to add counts and relative percentages on bars graph

* Refactored name mozilla#2, Completed Prelimary Analysis and Interpreted Results

* Update Issue mozilla#2 - Train and test a classification model (PRESC).ipynb

* Files reformated with 'Black'

* Display Error corrected

* Interpreted choice of hyper-parameters

* Refactored and Added Modules used for Issue 3

* Prelimanry Analysis - Traversal of the space of train_test splits

* Issue#3 complete

* Removed Issues mozilla#2 and mozilla#3 ipynb

* Issue mozilla#4 - completed

Issue mozilla#4 - Traversal of the space of cross-validation folds

* Delete defaults_data.csv

Removing duplication of the existing data set which can be loaded from the repos root directory.

Co-authored-by: mlopatka <[email protected]>
…aset (mozilla#92)

* Classification model wine.csv

* Classification model wine.csv

* Merging modifications
mozilla#2 Dropped quality, shifted the logic to python file, shifted imports to the top, added confusion_matrix and classification_report
Adding logistic regression for winequality.csv
)

* fixes mozilla#8

* fixes mozilla#4, attempt 1

* updated missclassification graph and brokedown functions

* first attempt to fix # 3

* implemeneted all change requests

* formatted code for all helper files

* minor fix

* fixed code formatting issues and  removed extra file

* fixed code formatting, added docstring to func

* fixed relative path

* fixed all changes requested

* fixed relative path in notebook

* fixing conflict with some file changes

* fixing attempt last for conflicts
* fixes mozilla#8

* fixes mozilla#4, attempt 1

* first attempt to fix # 3

* implemeneted all change requests

* formatted code for all helper files

* minor fix

* fixed code formatting, added docstring to func

* first attempt on 63

* fixing conflicts

Co-authored-by: mlopatka <[email protected]>
…ear Model (Stochastic Gradient Descent) on winequality.csv (mozilla#58)

* adds incomplete files

* adds .ipynb, .py and updates environment.yml

* Delete winequality.ipynb

removing duplicate files

* Delete winequality_modules.py

removing duplicate files

* Delete winequality.ipynb

removing incomplete files

* Delete winequality_modules.py

removing incomplete files

* adds .ipynb, .py and updates environment.yml

* adds description and deatiled reasoning for the methods, models and parameters used

* drops quality column

* updates .py file

* adds files in a new folder

* updates .yml
…mozilla#111)

* WIP: mozilla#2 on the dataset 'eeg.csv'

WIP: mozilla#2 on the dataset 'eeg.csv'

* Add files via upload

* Delete WIP: mozilla#2 on the dataset 'eeg.csv'

* Delete mozilla#2  Train and test a classification model, eeg.csv-checkpoint.ipynb

* WIP: mozilla#2 on the dataset 'eeg.csv'

* Delete mozilla#2  Train and test a classification model, eeg.csv-checkpoint.ipynb

* WIP:  mozilla#2 Train and test a classification model, eeg.csv dataset

* Delete mozilla#2  Train and test a classification model, eeg.csv.ipynb

* Create README

* WIP: mozilla#2 Train and test a classification model, eeg.csv dataset

* Delete README
issue#3:traversal-of-the-space-of-train-test-splits
For mozilla#2: on the dataset 'winequality.csv'
@iamarchisha
Copy link
Contributor Author

I was trying to push lfs files in the branch and in the process I made a few mistakes. I tried but could not find a way to revert the changes made. Is it okay if I close this PR and make a new one? Or if there is something else that could help solve the problem?

@iamarchisha iamarchisha deleted the covariate branch March 29, 2020 12:18
@mlopatka
Copy link
Contributor

@archisha-chandel can you link to the new PR in a comment?

@iamarchisha
Copy link
Contributor Author

New PR #136 solves the issue #78

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Outreachy applications] Covariate Shift