-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Share factorize
implementation with Index and cudf module
#6885
Share factorize
implementation with Index and cudf module
#6885
Conversation
Please update the changelog in order to start CI tests. View the gpuCI docs here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is missing the top level cudf.factorize
function?
@kkraus14 I moved things around a bunch here. Let me know if it makes more sense this way. Also, 0.18 for this I assume? |
Codecov Report
@@ Coverage Diff @@
## branch-0.18 #6885 +/- ##
===============================================
+ Coverage 82.01% 82.11% +0.09%
===============================================
Files 96 97 +1
Lines 16340 16492 +152
===============================================
+ Hits 13402 13543 +141
- Misses 2938 2949 +11
Continue to review full report at Codecov.
|
python/cudf/cudf/core/series.py
Outdated
(labels, cats) : (Series, Series) | ||
- *labels* contains the encoded values | ||
- *cats* contains the categories in order that the N-th | ||
item corresponds to the (N-1) code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cats
return type is input dependent, no? If an Index is input I think it's expected to get an Index in return for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ohh! I did not notice that at all. Good catch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked into this and I wanted to make sure the following makes sense. In pandas, factorizing a pd.Series
or pd.Index
object gives us back a pd.Index
here, whereas factorizing an numpy array just returns a numpy array. Would we want to return a cupy array for cupy input then? Would we want to return a cupy array for labels
as well? (currently a cudf.Series
presumably to avoid a host copy)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For labels
does Pandas return a pd.Index
object as well?
I think it makes sense to follow Pandas here, where cudf inputs --> cudf index, cupy inputs --> cupy array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For labels
, pandas returns a numpy array in all three cases - Series, Index, and NumPy array. So we'd return a cupy
array here I assume?
Agreed re: your second point. Since factorize
is a user-facing API, would this warrant the breaking
label?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For
labels
, pandas returns a numpy array in all three cases - Series, Index, and NumPy array. So we'd return acupy
array here I assume?
Yes, this makes sense to me assuming it's always returning integers.
Agreed re: your second point. Since
factorize
is a user-facing API, would this warrant thebreaking
label?
Yup, this would be a potentially breaking change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
edit: pending Keith's suggestions of course :)
rerun tests |
Share the implementation of
cudf.Series.factorize
with theIndex
class and thecudf
module namespace.Closes #6871