Data sensitivity feature, fixes # 8 #31

Sidrah-Madiha · 2020-03-09T08:14:39Z

This branch that I have merged fixes #8 ,

it contains a helper function file "helper_for_senstivity_calculation.py" which has a function "calculate_senstivity" that returns L1 sensitivity of the dataset and list of performance metric for all parallel dataset
I have also added a function to display the most sensitive datapoint as well as the index of that data point in training dataset.
I have tested it on the random forest classifier that I made for vehicles.csv dataset, please see test demo in jupyer notebook in folder: " Importance score for dataset training samples" > vehicles_dataset_classifer_v1.ipynb file

@dzeber I was not sure if the performance metric should be accuracy or score so I implemented it with score and commented out code that I used with accuracy . Please advise for the next steps to improve this fix.

dzeber

Nice work! Your PR provides an elegant solution for this task. I'm requesting changes primarily because I think a visualization would be more appropriate for the final list, as commented below, but please consider the other recommendations.

Your calculate_senstivity() function allows you to pass in the model - what would be even better is to pass in a scoring function as well (which can default to accuracy).

Also, please remove the other unrelated files from the PR.

dzeber · 2020-03-11T23:05:01Z

...ah-Madiha/Importance score for dataset training samples/helper_for_senstivity_calculation.py

+    model_func.fit(X_train, y_train)
+    # y_pred=model_func.predict(X_test) 
+    # return metrics.accuracy_score(y_test, y_pred)
+    # I am not sure if accuracy or score is record for senstivity calculation so I commented out accuracy code


I'm pretty sure these do the same think. I thing model_func.score() is just a shortcut.

dzeber · 2020-03-11T23:08:21Z

...rah-Madiha/Importance score for dataset training samples/vehicles_dataset_classifer_v1.ipynb

+    }
+   ],
+   "source": [
+    "print(\"priniting model performance score for all parallel datasets\", list_scores)\n",


Could you show a visualization instead of the full list?

dzeber · 2020-03-11T23:15:00Z

...ah-Madiha/Importance score for dataset training samples/helper_for_senstivity_calculation.py

+    list_train_data_label =create_all_parallel_dbs(X_train, y_train) #return list containing tuples of training data and label
+    #print(len(list_train_data_label ))
+    for parallel_train, parallel_label in list_train_data_label:
+        parallel_dataset_score =model_performance_evaluater_score(model_func,parallel_train, parallel_label, X_test, y_test)


Rather than precomputing all the parallel DBs, why not call create_parallel_db() inside this loop and compute the score right away for each. This will probably be easier for larger datasets.

dzeber · 2020-03-11T23:16:47Z

...ah-Madiha/Importance score for dataset training samples/helper_for_senstivity_calculation.py

+    #print(len(list_train_data_label ))
+    for parallel_train, parallel_label in list_train_data_label:
+        parallel_dataset_score =model_performance_evaluater_score(model_func,parallel_train, parallel_label, X_test, y_test)
+        dataset_distance = np.abs(full_dataset_score - parallel_dataset_score ) # L1 senstivity


I'm thinking it would be interesting compute the distance without the absolute value. It would probably be valuable to know if certain training samples actually improve the model fit when being left out.

mlopatka · 2020-03-16T22:26:19Z

@Sidrah-Madiha please let us know if you intend to continue with this PR or if this work has been integrated to another PR and this one can be closed out.

Sidrah-Madiha · 2020-03-17T09:05:35Z

@mlopatka , yes I am working on fixing it today.

Sidrah-Madiha · 2020-03-17T09:47:12Z

@mlopatka please accept my PR for this issue, I have fixed all change requests.

reformatted code for helper file with black.
showed a visualization of full list of scores for each parallel db. Please see image attached
fixed this issue (Rather than precomputing all the parallel DBs, why not call create_parallel_db() inside this loop and compute the score right away for each. This will probably be easier for larger datasets.) please see attached image for changes for code refactoring
removed absolute function from sensitivity distance difference calculation. (showed in above image as well)

Sidrah-Madiha · 2020-03-21T00:56:05Z

Hi @dzeber, please merge this PR, I have fixed all change requests

Sidrah-Madiha · 2020-03-22T20:23:21Z

Hi @dzeber @mlopatka I have further improved this fix, I have added a scoring parameter with default "accuracy_score in function "model_performance_evaluater_score"
I have fixed relative path issue in notebook
I have added an indicator in the visualization to indicate the most sensitive datapoint, fixed labels to have better description.

Please review

mlopatka

Thank you for addressing all of the outstanding feedback in this PR.
I know it was hard to resolve all the merge conflicts, but thank you for sticking with it.

Sidrah-Madiha added 5 commits March 7, 2020 12:46

First attempt on vehicle data with a random forest calssifier

9ae1c5f

minor changes

f320c86

Comparative model evaluation for vehicle dataset

9e3aeba

first attempt for implementing task 7

a7acad0

fixes mozilla#8

4d94959

dzeber suggested changes Mar 11, 2020

View reviewed changes

fixed all change requests

d23f197

Sidrah-Madiha force-pushed the data_senstivity_feature branch from 36c89d9 to d23f197 Compare March 17, 2020 14:15

fixed relative path, improved visualisation

948d3a2

Sidrah-Madiha added 2 commits March 23, 2020 01:30

minor fix in plot

e757a96

added absolute distance for comapring with senstivity calculaation

ac1e213

mlopatka approved these changes Mar 25, 2020

View reviewed changes

mlopatka merged commit 30dbbae into mozilla:master Mar 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data sensitivity feature, fixes # 8 #31

Data sensitivity feature, fixes # 8 #31

Sidrah-Madiha commented Mar 9, 2020

dzeber left a comment

dzeber Mar 11, 2020

dzeber Mar 11, 2020

dzeber Mar 11, 2020

dzeber Mar 11, 2020

mlopatka commented Mar 16, 2020

Sidrah-Madiha commented Mar 17, 2020

Sidrah-Madiha commented Mar 17, 2020 •

edited

Loading

Sidrah-Madiha commented Mar 21, 2020

Sidrah-Madiha commented Mar 22, 2020 •

edited

Loading

mlopatka left a comment

Data sensitivity feature, fixes # 8 #31

Data sensitivity feature, fixes # 8 #31

Conversation

Sidrah-Madiha commented Mar 9, 2020

dzeber left a comment

Choose a reason for hiding this comment

dzeber Mar 11, 2020

Choose a reason for hiding this comment

dzeber Mar 11, 2020

Choose a reason for hiding this comment

dzeber Mar 11, 2020

Choose a reason for hiding this comment

dzeber Mar 11, 2020

Choose a reason for hiding this comment

mlopatka commented Mar 16, 2020

Sidrah-Madiha commented Mar 17, 2020

Sidrah-Madiha commented Mar 17, 2020 • edited Loading

Sidrah-Madiha commented Mar 21, 2020

Sidrah-Madiha commented Mar 22, 2020 • edited Loading

mlopatka left a comment

Choose a reason for hiding this comment

Sidrah-Madiha commented Mar 17, 2020 •

edited

Loading

Sidrah-Madiha commented Mar 22, 2020 •

edited

Loading