-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data sensitivity feature, fixes # 8 #31
Data sensitivity feature, fixes # 8 #31
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! Your PR provides an elegant solution for this task. I'm requesting changes primarily because I think a visualization would be more appropriate for the final list, as commented below, but please consider the other recommendations.
Your calculate_senstivity()
function allows you to pass in the model - what would be even better is to pass in a scoring function as well (which can default to accuracy).
Also, please remove the other unrelated files from the PR.
model_func.fit(X_train, y_train) | ||
# y_pred=model_func.predict(X_test) | ||
# return metrics.accuracy_score(y_test, y_pred) | ||
# I am not sure if accuracy or score is record for senstivity calculation so I commented out accuracy code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure these do the same think. I thing model_func.score()
is just a shortcut.
} | ||
], | ||
"source": [ | ||
"print(\"priniting model performance score for all parallel datasets\", list_scores)\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you show a visualization instead of the full list?
list_train_data_label =create_all_parallel_dbs(X_train, y_train) #return list containing tuples of training data and label | ||
#print(len(list_train_data_label )) | ||
for parallel_train, parallel_label in list_train_data_label: | ||
parallel_dataset_score =model_performance_evaluater_score(model_func,parallel_train, parallel_label, X_test, y_test) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than precomputing all the parallel DBs, why not call create_parallel_db()
inside this loop and compute the score right away for each. This will probably be easier for larger datasets.
#print(len(list_train_data_label )) | ||
for parallel_train, parallel_label in list_train_data_label: | ||
parallel_dataset_score =model_performance_evaluater_score(model_func,parallel_train, parallel_label, X_test, y_test) | ||
dataset_distance = np.abs(full_dataset_score - parallel_dataset_score ) # L1 senstivity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking it would be interesting compute the distance without the absolute value. It would probably be valuable to know if certain training samples actually improve the model fit when being left out.
@Sidrah-Madiha please let us know if you intend to continue with this PR or if this work has been integrated to another PR and this one can be closed out. |
@mlopatka , yes I am working on fixing it today. |
@mlopatka please accept my PR for this issue, I have fixed all change requests.
|
36c89d9
to
d23f197
Compare
Hi @dzeber, please merge this PR, I have fixed all change requests |
Hi @dzeber @mlopatka I have further improved this fix, I have added a scoring parameter with default "accuracy_score in function "model_performance_evaluater_score" Please review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for addressing all of the outstanding feedback in this PR.
I know it was hard to resolve all the merge conflicts, but thank you for sticking with it.
This branch that I have merged fixes #8 ,
@dzeber I was not sure if the performance metric should be accuracy or score so I implemented it with score and commented out code that I used with accuracy . Please advise for the next steps to improve this fix.