-
Notifications
You must be signed in to change notification settings - Fork 525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Sklearn meta-estimators into namespace #3493
[REVIEW] Sklearn meta-estimators into namespace #3493
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! It's a straightforward change that will help make the transition easier for people currently using an sklearn-only workflow.
A few thoughts on this:
- We should probably delay merging this until [REVIEW] Ensure global_output_type is thread-safe #3497 is in, since our estimators currently do not play nice with sklearn's
GridSearchCV
in any sort of multithreaded environment. - We should probably do something to explicitly and unambiguously give the fine folks of scikit-learn credit for these features in our docs. I could imagine doing that a couple different ways. One would be to edit the docstring at import time, adding a line that clarifies where these classes come from. The other would be to wrap them in a class that provides a docstring which links directly to scikit-learn documentation.
- Since we're bringing these features into our namespace, do we want to do something to test them as well? I don't think there's any need to duplicate sklearn tests, but perhaps we could explicitly use these features with one of our estimators, which would open up the possibility of detecting bugs like the one addressed by [REVIEW] Ensure global_output_type is thread-safe #3497.
Updating the docstrings seems appealing... if we import 4 classes, we can iterate over them and append something vaguely like this: """This code is developed and maintained by scikit-learn (insert proper citation here) and imported by cuML to maintain the familiar sklearn.* namespace structure. cuML includes tests to ensure full compatibility of these wrappers with CUDA-based data and cuML estimators, but all of the underlying code is due to the scikit-learn developers.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great... I just have one question/topic regarding tests.
|
||
GridSearchCV.__doc__ = """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, bummer that we have to copy past since these are in separate files (pipeline and model_selection) - if it goes to a third file ever I think we should break it out in a shared way
from cuml.svm import SVC | ||
|
||
|
||
def test_pipeline(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we test that every model is compatible? seems like we could do this with about 10 rows of data and make it fast. Could use the same approach as test_pickle to grab at least all regression and classification models pretty easily. This is an open question, so LMK if you think it's a bad idea...
It might be nice also to confirm that output type doesn't get messed up. Basically, predict should preserve the on-gpu-ness of the data even through the pipeline (predict + check output class type). Not sure the scoring is even necessary since this is more a question of whether the pipeline is flowing...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. I know you're still tweaking some things, but wanted to go ahead and approve what I've seen so far since I'll be out next week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
rerun tests |
8699bad
to
d53f13a
Compare
rerun tests |
Codecov Report
@@ Coverage Diff @@
## branch-0.19 #3493 +/- ##
===============================================
+ Coverage 79.21% 80.84% +1.62%
===============================================
Files 226 228 +2
Lines 17900 17742 -158
===============================================
+ Hits 14180 14343 +163
+ Misses 3720 3399 -321
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
@gpucibot merge |
Closes #3484
Imports sklearn's Pipeline and GridSearch meta-estimators into cuML namespace for ease-of-use.