Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Share factorize implementation with Index and cudf module #6885

Merged
merged 20 commits into from
Dec 19, 2020

Conversation

brandon-b-miller
Copy link
Contributor

Share the implementation of cudf.Series.factorize with the Index class and the cudf module namespace.

Closes #6871

@brandon-b-miller brandon-b-miller added feature request New feature or request Python Affects Python cuDF API. non-breaking Non-breaking change labels Dec 2, 2020
@brandon-b-miller brandon-b-miller requested a review from a team as a code owner December 2, 2020 21:47
@brandon-b-miller brandon-b-miller self-assigned this Dec 2, 2020
@GPUtester
Copy link
Collaborator

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

Copy link
Collaborator

@kkraus14 kkraus14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is missing the top level cudf.factorize function?

python/cudf/cudf/core/series.py Outdated Show resolved Hide resolved
@brandon-b-miller
Copy link
Contributor Author

brandon-b-miller commented Dec 4, 2020

@kkraus14 I moved things around a bunch here. Let me know if it makes more sense this way. Also, 0.18 for this I assume?

@codecov
Copy link

codecov bot commented Dec 4, 2020

Codecov Report

Merging #6885 (d927cfc) into branch-0.18 (515a173) will increase coverage by 0.09%.
The diff coverage is 86.36%.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.18    #6885      +/-   ##
===============================================
+ Coverage        82.01%   82.11%   +0.09%     
===============================================
  Files               96       97       +1     
  Lines            16340    16492     +152     
===============================================
+ Hits             13402    13543     +141     
- Misses            2938     2949      +11     
Impacted Files Coverage Δ
python/cudf/cudf/core/series.py 91.15% <ø> (+0.05%) ⬆️
python/cudf/cudf/core/algorithms.py 82.35% <82.35%> (ø)
python/cudf/cudf/__init__.py 100.00% <100.00%> (ø)
python/cudf/cudf/core/index.py 93.27% <100.00%> (+0.01%) ⬆️
python/cudf/cudf/core/multiindex.py 81.20% <100.00%> (ø)
python/cudf/cudf/core/frame.py 89.97% <0.00%> (-0.38%) ⬇️
python/cudf/cudf/core/column/timedelta.py 89.16% <0.00%> (-0.38%) ⬇️
python/cudf/cudf/core/column/numerical.py 94.41% <0.00%> (-0.16%) ⬇️
python/dask_cudf/dask_cudf/io/tests/test_csv.py 100.00% <0.00%> (ø)
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 515a173...d927cfc. Read the comment docs.

@kkraus14 kkraus14 added the 3 - Ready for Review Ready for review by team label Dec 4, 2020
python/cudf/cudf/core/series.py Outdated Show resolved Hide resolved
python/cudf/cudf/tests/test_factorize.py Show resolved Hide resolved
python/cudf/cudf/core/series.py Outdated Show resolved Hide resolved
Comment on lines 5235 to 5238
(labels, cats) : (Series, Series)
- *labels* contains the encoded values
- *cats* contains the categories in order that the N-th
item corresponds to the (N-1) code.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cats return type is input dependent, no? If an Index is input I think it's expected to get an Index in return for example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohh! I did not notice that at all. Good catch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into this and I wanted to make sure the following makes sense. In pandas, factorizing a pd.Series or pd.Index object gives us back a pd.Index here, whereas factorizing an numpy array just returns a numpy array. Would we want to return a cupy array for cupy input then? Would we want to return a cupy array for labels as well? (currently a cudf.Series presumably to avoid a host copy)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For labels does Pandas return a pd.Index object as well?

I think it makes sense to follow Pandas here, where cudf inputs --> cudf index, cupy inputs --> cupy array.

Copy link
Contributor Author

@brandon-b-miller brandon-b-miller Dec 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For labels, pandas returns a numpy array in all three cases - Series, Index, and NumPy array. So we'd return a cupy array here I assume?

Agreed re: your second point. Since factorize is a user-facing API, would this warrant the breaking label?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For labels, pandas returns a numpy array in all three cases - Series, Index, and NumPy array. So we'd return a cupy array here I assume?

Yes, this makes sense to me assuming it's always returning integers.

Agreed re: your second point. Since factorize is a user-facing API, would this warrant the breaking label?

Yup, this would be a potentially breaking change.

python/cudf/cudf/core/series.py Outdated Show resolved Hide resolved
Copy link
Contributor

@shwina shwina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

edit: pending Keith's suggestions of course :)

@brandon-b-miller brandon-b-miller added breaking Breaking change and removed non-breaking Non-breaking change labels Dec 11, 2020
python/cudf/cudf/core/algorithms.py Outdated Show resolved Hide resolved
@kkraus14
Copy link
Collaborator

rerun tests

@kkraus14 kkraus14 added 5 - Ready to Merge Testing and reviews complete, ready to merge 6 - Okay to Auto-Merge and removed 3 - Ready for Review Ready for review by team labels Dec 19, 2020
@rapids-bot rapids-bot bot merged commit 923cf49 into rapidsai:branch-0.18 Dec 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge breaking Breaking change feature request New feature or request Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] factorize function
4 participants