Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data sensitivity feature, fixes # 8 #31

Merged
merged 9 commits into from
Mar 25, 2020

Conversation

Sidrah-Madiha
Copy link
Contributor

This branch that I have merged fixes #8 ,

  • it contains a helper function file "helper_for_senstivity_calculation.py" which has a function "calculate_senstivity" that returns L1 sensitivity of the dataset and list of performance metric for all parallel dataset
  • I have also added a function to display the most sensitive datapoint as well as the index of that data point in training dataset.
  • I have tested it on the random forest classifier that I made for vehicles.csv dataset, please see test demo in jupyer notebook in folder: " Importance score for dataset training samples" > vehicles_dataset_classifer_v1.ipynb file

@dzeber I was not sure if the performance metric should be accuracy or score so I implemented it with score and commented out code that I used with accuracy . Please advise for the next steps to improve this fix.

Copy link
Contributor

@dzeber dzeber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Your PR provides an elegant solution for this task. I'm requesting changes primarily because I think a visualization would be more appropriate for the final list, as commented below, but please consider the other recommendations.

Your calculate_senstivity() function allows you to pass in the model - what would be even better is to pass in a scoring function as well (which can default to accuracy).

Also, please remove the other unrelated files from the PR.

model_func.fit(X_train, y_train)
# y_pred=model_func.predict(X_test)
# return metrics.accuracy_score(y_test, y_pred)
# I am not sure if accuracy or score is record for senstivity calculation so I commented out accuracy code
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure these do the same think. I thing model_func.score() is just a shortcut.

}
],
"source": [
"print(\"priniting model performance score for all parallel datasets\", list_scores)\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you show a visualization instead of the full list?

list_train_data_label =create_all_parallel_dbs(X_train, y_train) #return list containing tuples of training data and label
#print(len(list_train_data_label ))
for parallel_train, parallel_label in list_train_data_label:
parallel_dataset_score =model_performance_evaluater_score(model_func,parallel_train, parallel_label, X_test, y_test)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than precomputing all the parallel DBs, why not call create_parallel_db() inside this loop and compute the score right away for each. This will probably be easier for larger datasets.

#print(len(list_train_data_label ))
for parallel_train, parallel_label in list_train_data_label:
parallel_dataset_score =model_performance_evaluater_score(model_func,parallel_train, parallel_label, X_test, y_test)
dataset_distance = np.abs(full_dataset_score - parallel_dataset_score ) # L1 senstivity
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking it would be interesting compute the distance without the absolute value. It would probably be valuable to know if certain training samples actually improve the model fit when being left out.

@mlopatka
Copy link
Contributor

@Sidrah-Madiha please let us know if you intend to continue with this PR or if this work has been integrated to another PR and this one can be closed out.

@Sidrah-Madiha
Copy link
Contributor Author

@mlopatka , yes I am working on fixing it today.

@Sidrah-Madiha
Copy link
Contributor Author

Sidrah-Madiha commented Mar 17, 2020

@mlopatka please accept my PR for this issue, I have fixed all change requests.

  • reformatted code for helper file with black.

  • showed a visualization of full list of scores for each parallel db. Please see image attached
    sens2

  • fixed this issue (Rather than precomputing all the parallel DBs, why not call create_parallel_db() inside this loop and compute the score right away for each. This will probably be easier for larger datasets.) please see attached image for changes for code refactoring
    sens1

  • removed absolute function from sensitivity distance difference calculation. (showed in above image as well)

@Sidrah-Madiha
Copy link
Contributor Author

Hi @dzeber, please merge this PR, I have fixed all change requests

@Sidrah-Madiha
Copy link
Contributor Author

Sidrah-Madiha commented Mar 22, 2020

Hi @dzeber @mlopatka I have further improved this fix, I have added a scoring parameter with default "accuracy_score in function "model_performance_evaluater_score"
I have fixed relative path issue in notebook
I have added an indicator in the visualization to indicate the most sensitive datapoint, fixed labels to have better description.
sensitive

Please review

Copy link
Contributor

@mlopatka mlopatka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for addressing all of the outstanding feedback in this PR.
I know it was hard to resolve all the merge conflicts, but thank you for sticking with it.

@mlopatka mlopatka merged commit 30dbbae into mozilla:master Mar 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Outreachy applications] Importance score for dataset training samples
3 participants