Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

updating fork #1

Merged
merged 82 commits into from
Mar 25, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
82 commits
Select commit Hold shift + click to select a range
9ae1c5f
First attempt on vehicle data with a random forest calssifier
Sidrah-Madiha Mar 7, 2020
4514f16
Classification of wine quality with random forest
BBimie Mar 7, 2020
f320c86
minor changes
Sidrah-Madiha Mar 8, 2020
9e3aeba
Comparative model evaluation for vehicle dataset
Sidrah-Madiha Mar 8, 2020
a7acad0
first attempt for implementing task 7
Sidrah-Madiha Mar 8, 2020
d231eac
First attempt in building a model for the dataset
Bolaji61 Mar 9, 2020
9b49719
Added .idea to gitignore
shashigharti Mar 9, 2020
83736c1
Added jupyter notebook
shashigharti Mar 9, 2020
24ef0c7
fixes # 7
Sidrah-Madiha Mar 10, 2020
09167bb
final changes to evaluator
Sidrah-Madiha Mar 10, 2020
a0027a7
added interpretation
Sidrah-Madiha Mar 11, 2020
aaf570f
Used SVM for Training Vehicles Dataset
shashigharti Mar 12, 2020
f5d97f2
Added comment for GridSearchCV
shashigharti Mar 12, 2020
0374650
Added docblock for functions
shashigharti Mar 12, 2020
2db06b8
updated the accuracy rate in jupyter notebook
shashigharti Mar 12, 2020
ef65308
Merge branch 'master' into shashigharti/issue-2
shashigharti Mar 12, 2020
d37c4d3
formatted code with black and rearranged folder structure removing ex…
Sidrah-Madiha Mar 13, 2020
03aba8d
displayed graph of all models
Sidrah-Madiha Mar 13, 2020
5078a4b
Changed the logic to apply transformation to test and train data
shashigharti Mar 13, 2020
55581b8
Changed the logic to apply transformation to test and train data
shashigharti Mar 13, 2020
d424ee9
fixed pairplot image visibility
shashigharti Mar 13, 2020
ee1492d
Changed scaler code
shashigharti Mar 13, 2020
ce0e958
Removed .DS_Store from repository
Bolaji61 Mar 14, 2020
c97f1b5
Creating a modules.py file & updating to pass black formatting check
Bolaji61 Mar 14, 2020
bda4b44
Updating my copy of the repository
Bolaji61 Mar 14, 2020
a1317b1
Delete .DS_Store
Bolaji61 Mar 14, 2020
6aa3dda
Moved the models used in my code to a separate modules.py file
Bolaji61 Mar 15, 2020
a4f4d6d
Added two conservative function for outlier removal
shashigharti Mar 15, 2020
2ff83e8
Fixed comment in notebook
shashigharti Mar 16, 2020
152c21d
fixed transformation code in helpers.py
shashigharti Mar 16, 2020
83b6dd5
fixed transformation code in helpers.py
shashigharti Mar 16, 2020
a395a71
fixed transformation code in helpers.py
shashigharti Mar 16, 2020
bb7f3e9
fixed all changes requested
Sidrah-Madiha Mar 16, 2020
3fa549b
fixed all changes requested
Sidrah-Madiha Mar 16, 2020
d9955d0
issue #02 - Train and test on classification of vehicle dataset with …
shiza16 Mar 16, 2020
bd53913
First attempt on vehicle dataset with a random forest classifier (#13)
Sidrah-Madiha Mar 16, 2020
88e0246
WIP: EDA and SVM (#34)
Clare-Joyce Mar 16, 2020
1fc8996
Proper documentation and explanation of my codes
Bolaji61 Mar 17, 2020
29c7c07
Prediction of the wine quality with Random forest adjusted
BBimie Mar 18, 2020
dc30f9c
Fixes #8: Added method and example to show importance of various data…
KaairaGupta Mar 18, 2020
2818730
Issue #7 - Visualization of missclassifications using a redefinition …
alberginia Mar 19, 2020
a1d3d4e
KNN model trained and tested on generated.csv dataset (#28)
tab1tha Mar 19, 2020
5e29796
Merge pull request #15 from BBimie/master
dzeber Mar 19, 2020
adabd94
path fixed for pd.read_csv
Sidrah-Madiha Mar 20, 2020
48211a9
Evaluating the Performance of LogisticRegression model
Bolaji61 Mar 20, 2020
2f85565
Updating my copy of the repo
Bolaji61 Mar 20, 2020
730e359
WIP: Trained and tested classification models for Defaults dataset (#56)
dzekem Mar 20, 2020
3ba024c
Lift-Gain Charts for classification models (#39)
Addi-11 Mar 20, 2020
bd792a4
Updated the README.md file (#61)
janvi04 Mar 20, 2020
1218953
fixes #3 (#45)
Addi-11 Mar 20, 2020
3a27b6a
feature implementation for fix # 4 (#38)
Sidrah-Madiha Mar 20, 2020
3734dd3
[ Fixes: #2 ] Training and Testing a Classification Model- Ensemble M…
iamarchisha Mar 20, 2020
d4c7757
[ Fixes: #7 ] Visualizing Misclassification for Binary Target (#59)
iamarchisha Mar 20, 2020
6543432
For #2 : Logistic Regresssion on winequaliy.csv (#37)
SanchiMittal Mar 20, 2020
089104f
WIP EDA and a simple modeling (#66)
hammedb197 Mar 20, 2020
26680d9
Fixed formatting
shashigharti Mar 20, 2020
dde72c1
Train test ratio (#43)
tab1tha Mar 20, 2020
2bd51a5
Importance score of a data point (#75)
tab1tha Mar 20, 2020
66f4c05
For #5 : Calibration plot (#35)
KaairaGupta Mar 20, 2020
82ac58d
The effect of number of folds on the cross_validated average performa…
tab1tha Mar 20, 2020
c19e6e8
Merge pull request #23 from Sidrah-Madiha/visualization_for_misclassi…
dzeber Mar 20, 2020
29d61bd
adds .ipynb and .py for KFold CV (#89)
iamarchisha Mar 20, 2020
fb0c69c
#5 Calibration Plots (#69)
Addi-11 Mar 20, 2020
e75320b
For #63: Learning from misclassification (#64)
KaairaGupta Mar 20, 2020
343efaf
Merge pull request #30 from Bolaji61/master
dzeber Mar 20, 2020
9845486
Merge pull request #51 from shashigharti/shashigharti/issue-2
dzeber Mar 20, 2020
ec67936
issue #4 - Traversal of the space of cross-validation folds (#68)
shashigharti Mar 20, 2020
5316977
Issue#7 visualization for misclassifications (#73)
shiza16 Mar 20, 2020
8b8bdcc
[ Fixes: #63 ]Learn from Misclassification (#74)
iamarchisha Mar 20, 2020
ee63bcc
Issue #2 - Tools for exploratory analysis of datasets and to decide o…
alberginia Mar 20, 2020
95ab903
Calibration plots for classifiers (#50)
Soniyanayak51 Mar 20, 2020
e238d61
Niti kaur2 (#83)
NitiKaur Mar 20, 2020
24d8133
fixed relative path in notebook
Sidrah-Madiha Mar 21, 2020
982d282
Merge branch 'Comparative_Models_vehicle_dataset' of https://github.c…
dzeber Mar 21, 2020
cf7637a
Merge branch 'Sidrah-Madiha-Comparative_Models_vehicle_dataset'
dzeber Mar 21, 2020
9f898ea
Update .gitignore
mlopatka Mar 21, 2020
d743eca
Delete .gitignore
mlopatka Mar 21, 2020
f298eca
Add note on project closure to README
dzeber Mar 23, 2020
f36b34f
[ Fixes: #103 ] Adding Black as a pre-commit hook (#104)
iamarchisha Mar 24, 2020
b882029
for #6 : Visualise evaluation metric (#27)
KaairaGupta Mar 24, 2020
30dbbae
Data sensitivity feature, fixes # 8 (#31)
Sidrah-Madiha Mar 25, 2020
a186ae6
Calibration plot [fixes # 5] (#110)
Sidrah-Madiha Mar 25, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# Byte-compiled / optimized / DLL files
__pycache__/
.idea/

*.py[cod]
*$py.class

Expand Down Expand Up @@ -109,6 +111,8 @@ venv/
ENV/
env.bak/
venv.bak/
.vs
.vscode

# Spyder project settings
.spyderproject
Expand All @@ -128,5 +132,9 @@ dmypy.json
# Pyre type checker
.pyre/

# Phpstorm
.idea

# Mac crap
.DS_Store

40 changes: 38 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,8 +96,21 @@ other contributions at this point, unless to resolve errors or typos.
Code formatting guidelines should strinctly adhere to [Python Black](https://pypi.org/project/black/) formatting guidelines. Please ensure that all PRs pass a local black formatting check.




## Information for Outreachy participants

__Please note that this project is currently closed to new Outreachy
contributions.__

- At this time, we are only considering Outreachy candidates who have submitted
a PR on or before _Friday March 20_.
- If you have submitted a PR by this date, you may continue working on existing
PRs or create new ones as usual. All your contributions will be considered.
- If you have not yet submitted a PR by this date, we will unfortunately not be
able to consider you as an Outreachy candidate for this round.


This project is intentionally broadly scoped, and the initial phase will be
exploratory.

Expand Down Expand Up @@ -140,24 +153,47 @@ Contributions can be made by submitting a [pull request](https://help.github.com
request review. This tag ('work in progress') indicates that the PR is not
ready to be merged. When it is ready for final submission, you can modify the
title to remove the "WIP:" tag.
- Should you use a separate jupyter notebook for comparing different models? If
you had a PR merged in to satisfy issue #2 already and are now comparing
models for another issue, then a new notebook would be helpful. That being
said, a notebook should satisfy the following criteria:

a) it should run beginning to end without error

b) it should be easy to follow and have a clear narrative presenting context,
data, results, and interpretation. This may mean some redundancy in code, but
will often mean that your notebook is much more helpful to other people
looking at it in isolation (including reviewers).


## Getting started

1. Install [Anaconda](https://www.anaconda.com/download) or [Miniconda](https://conda.io/miniconda.html).

2. Setup and activate environment:
2. Fork this repository and clone it into your local machine(using git CLI).

3. Setup and activate environment:

```
$ conda env create -f environment.yml
$ conda activate presc
```

3. Run Jupyter. The notebook will open in your browser at `localhost:8888` by default.

__For Windows:__ Open anaconda prompt and `cd` into the folder where you cloned the repository

```
cd PRESC
```
then type the above commands to activate the environment.


4. Run Jupyter. The notebook will open in your browser at `localhost:8888` by default.

```
$ jupyter notebook
```
After running this commands you will see the notebook containing the datasets and now you can start working with it.

We recommend everyone start by working on
[#2](https://github.com/mozilla/PRESC/issues/2).
Expand Down
330 changes: 330 additions & 0 deletions dev/Addi-11/calibration.ipynb

Large diffs are not rendered by default.

45 changes: 45 additions & 0 deletions dev/Addi-11/calibration.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
from sklearn.calibration import CalibratedClassifierCV, calibration_curve
import matplotlib.pyplot as plt
from sklearn.metrics import brier_score_loss

def calibration(clf, x_train, y_train, x_val, y_val):
'''
The function plots the calibration curves for classifaction models.

Parameters:
clf : trained classification moodel
x_train : array-like, shape(n_train_samples, n_features)
y_train : of length n_train_samples
x_val : array-like, shape(n_val_samples, n_features)
y_val : of length of n_val_samples

Returns:
null

'''

methods = ['sigmoid', 'isotonic']

fop = {}
apv ={}
clf_score = {}
for i in range(len(methods)):

calibrated_model = CalibratedClassifierCV(clf, method=methods[i], cv=5)
calibrated_model.fit(x_train, y_train)

y_score = calibrated_model.predict_proba(x_val)[:,1]
fop[i], apv[i] = calibration_curve(y_val, y_score, n_bins = 10, normalize=True)

clf_score[i] = brier_score_loss(y_val, y_score, pos_label=1)

plt.figure(figsize=(10,6))
plt.plot([0,1],[0,1])
plt.plot(apv[0], fop[0], label='Sigmoid (Brier loss={:.3f})'.format(clf_score[0]))
plt.plot(apv[1], fop[1], label='Isotonic(Brier loss={:.3f})'.format(clf_score[1]))
plt.grid()
plt.xlabel("Average Probability")
plt.ylabel("Fraction of Positive")
plt.title("Calibration Plots")
plt.legend()
plt.show()
151 changes: 151 additions & 0 deletions dev/Addi-11/classifiers.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
# This file contains various classifiers to be used on the dataset
from evaluation import evaluate
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import plot_precision_recall_curve, confusion_matrix, plot_confusion_matrix
import matplotlib.pyplot as plt

class Classifier:
'''
This class contains different classification models which can be trained on the dataset.
'''
def svm_classifier(self,x_train,y_train):
'''
Support Vector Machine is a classifier
Parameters :
x_train : array-like, shape (n_samples, n_features)
y_train : of length n_samples

Returns :
classifier : trained classification model
'''
classifier = SVC(gamma='auto')
classifier.fit(x_train, y_train)
return classifier

def KNeighbors(self, x_train,y_train):
'''
K-Nearest Neighbours is supervised classifier, which takes a bunch of labelled points and uses them to learn how to label other points, wrt to thier degree of closeness.

Parameters :
x_train : array-like, shape (n_samples, n_features)
y_train : of length n_samples

Returns :
classifier : trained classification model
'''
classifier = KNeighborsClassifier()
classifier.fit(x_train, y_train)
return classifier

def Logistic_Reg(self, x_train,y_train):
'''
Logistic Regression, takes some input and calculates the probabilty of the outcome using mathematical functions like sigmoid or ReLu.

Parameters :
x_train : array-like, shape (n_samples, n_features)
y_train : of length n_samples

Returns :
classifier : trained classification model
'''
classifier = LogisticRegression()
classifier.fit(x_train, y_train)
return classifier

def Decision_Tree(self,x_train,y_train):
'''
Decision Tree Classifier, a mechanical way to make a decision by dividing the inputs into smaller decisions.

Parameters :
x_train : array-like, shape (n_samples, n_features)
y_train : of length n_samples

Returns :
classifier : trained classification model
'''
classifier = DecisionTreeClassifier()
classifier.fit(x_train, y_train)
return classifier

def Random_Forest(self, x_train,y_train):
'''
Random Forest Classifier, a way to make a decision by dividing the inputs into smaller decisions, with some randomness.The group outcomes are based on the positive responses. Used in Recommendation Systems.

Parameters :
x_train : array-like, shape (n_samples, n_features)
y_train : of length n_samples

Returns :
classifier : trained classification model
'''
classifier = RandomForestClassifier()
classifier.fit(x_train, y_train)
return classifier

def Gaussian(self, x_train,y_train):
'''
Gaussian Naive Bayes, classification technique based on Bayes’ Theorem with an assumption of independence among predictors. It is easy to build and particularly useful for very large data sets.

Parameters :
x_train : array-like, shape (n_samples, n_features)
y_train : of length n_samples

Returns :
classifier : trained classification model
'''
classifier = GaussianNB()
classifier.fit(x_train, y_train)
return classifier

def evaluation(self, classifier, x_val, y_val):
'''
This function is used to evaluate the performance of the trained model, using evaluation metrics like :
Accuracy
Precision
Recall
Precision Recall Curve
F1_score
Confusion Matrix
AUC-ROC Curve, on the validation set.

Parameters :
classifier : trained classification model
x_val : array-like, shape(n_samples, n_features)
y_val : of length n_samples

Returns :
void
'''
accuracy, precision, recall, f_score , y_score = evaluate(classifier, x_val, y_val)
print("Accuracy : ",accuracy)
print("Precision: ", precision)
print("Recall: ", recall)
print("F1 score : ",f_score)

# Plotting Precision Recall Curve
print("Precision vs Recall Curve")
disp = plot_precision_recall_curve(classifier,x_val, y_val)

# Plotting Confusion Matrix
print("Confusion Matrix")
labels = ['Class 1', 'Class 2']
cm = confusion_matrix(y_val, y_score)
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(cm)
plt.title('Confusion matrix of the classifier')
fig.colorbar(cax)
ax.set_xticklabels([''] + labels)
ax.set_yticklabels([''] + labels)
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()




49 changes: 49 additions & 0 deletions dev/Addi-11/data_split_examine.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# This file compares various evaluation metrics for different data splits

import numpy as np
from sklearn.model_selection import train_test_split
import pandas as pd
from IPython.display import HTML
from pylab import *
from dataloader import get_x_y
from evaluation import evaluate
from classifiers import Classifier



test_sizes = np.arange(0.005,1,0.05)
columns = ['Training data','Testing Data','Accuracy %', 'Precision', 'Recall', 'F1_score']
df = pd.DataFrame(columns = columns)

def data_split_examine(clf):
'''
The fuction calculates evaluation metrics like f1_score, accuracy, precision, recall for various test data sizes

Parameters:
clf : a trained classification model

Return:
void
'''
model = Classifier()
for index in range(len(test_sizes)):
x, y = get_x_y()
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = test_sizes[index])
classifier = getattr(model, clf)(x_train, y_train)
accuracy, precision, recall, f_score, _ = evaluate(classifier, x_test, y_test)
train = round((1-test_sizes[index])*100)
test = round(test_sizes[index]*100)
df.loc[index+1] = [train, test, accuracy*100, precision, recall, f_score]

display(df)

def visualise_split(clf):
'''
The function visualises the corelation between data splits and evaluation metrics by plotting graph between testing data sizes and accuracy.
'''
fig,axes = plt.subplots()
axes.set_xlabel("Test Data Size")
axes.set_ylabel("Accuracy %")
axes.set_ylim([50,100])
axes.set_title("Relation btw accuracy and test data size for {} classifier".format(clf))
disp = axes.plot(df['Testing Data'], df['Accuracy %'])
Loading