Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Series.drop api #7304

Merged
merged 22 commits into from
Mar 10, 2021
Merged

Add Series.drop api #7304

merged 22 commits into from
Mar 10, 2021

Conversation

isVoid
Copy link
Contributor

@isVoid isVoid commented Feb 4, 2021

Closes #7045

This PR introduces Series.drop API. Series.drop allows users to drop certain elements in the series specified labels or index parameter.

Example:

>>> s = cudf.Series([1, 2, 3], index=['x', 'y', 'z'])
>>> s.drop(labels=['y'])
x    1
z    3
dtype: int64
  • Add series test case
  • Move common code path from DataFrame.drop to helper function
  • Add typing annotation
  • Add docstring

@isVoid isVoid added 2 - In Progress Currently a work in progress Python Affects Python cuDF API. feature request New feature or request non-breaking Non-breaking change labels Feb 4, 2021
@codecov
Copy link

codecov bot commented Feb 4, 2021

Codecov Report

Merging #7304 (c5af5ca) into branch-0.19 (7871e7a) will increase coverage by 0.51%.
The diff coverage is 91.93%.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.19    #7304      +/-   ##
===============================================
+ Coverage        81.86%   82.38%   +0.51%     
===============================================
  Files              101      101              
  Lines            16884    17328     +444     
===============================================
+ Hits             13822    14275     +453     
+ Misses            3062     3053       -9     
Impacted Files Coverage Δ
python/cudf/cudf/core/index.py 93.34% <ø> (+0.48%) ⬆️
python/cudf/cudf/core/column/column.py 87.80% <75.00%> (+0.04%) ⬆️
python/cudf/cudf/core/column/numerical.py 94.85% <85.71%> (-0.17%) ⬇️
python/cudf/cudf/core/frame.py 89.12% <89.47%> (+0.10%) ⬆️
python/cudf/cudf/core/column/decimal.py 93.33% <90.47%> (-1.54%) ⬇️
python/cudf/cudf/core/dataframe.py 90.58% <95.00%> (+0.11%) ⬆️
python/cudf/cudf/core/series.py 91.57% <96.77%> (+0.78%) ⬆️
python/cudf/cudf/core/column/string.py 86.76% <100.00%> (+0.26%) ⬆️
python/cudf/cudf/io/feather.py 100.00% <0.00%> (ø)
python/cudf/cudf/comm/serialize.py 0.00% <0.00%> (ø)
... and 45 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cc104ef...a4f9007. Read the comment docs.

@isVoid isVoid changed the title Add Series.drop Add Series.drop api Feb 4, 2021
@isVoid isVoid marked this pull request as ready for review February 4, 2021 20:26
@isVoid isVoid requested a review from a team as a code owner February 4, 2021 20:26
@isVoid isVoid added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Feb 4, 2021

labels: a list of labels specifying the rows to drop
"""
return self.join(cudf.DataFrame(index=labels), how="leftanti")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice - very elegant!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the join parameters are fixed here, it might be worth jumping straight to the cython without first creating a dataframe. Something like

idx = Table(index=labels)
res = cudf._lib.join.join(self, idx, 'leftanti', None, left_on=[], right_on=[], left_index=True, right_index=True)
return self.__class__._from_table(res)

This way you avoid all the python overhead of constructing a dataframe and then a bunch of unecessary checking and typecasting that the merge codepath uses, which isn't necessarily needed in this case. The downside is you might want to throw if the labels passed dont match the dtype of the target series index.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a fast path that could exist in the merge code that we're not currently enabling?

Copy link
Contributor

@brandon-b-miller brandon-b-miller Feb 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how useful it would be for users, but after the upcoming join improvements it might be useful as an internal utility for join at least, since join is just a narrow case of merge with left_index=True, right_index=True and no on parameters.

A lot of the merge code is there just to validate all the combinations of arguments a user could pass and fix their dtypes if necessary. We should be able to get around that internally if we know something about what our cython API expects and are sure what we are passing to it is right. It would also help if cudf._lib.join.join had a better "developer facing" API, so the actual call to it from didn't seem so weird.

The counterargument is that to avoid libcudf itself erroring in an invalid case, we have to do a dtype check, to make sure the labels and the index are the same type. It could be argued that we might as well let the full user facing merge code error since it's already built to error in that case. But then, if someone passes something invalid, it's almost just as weird that the error comes from inside join.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of the merge code is there just to validate all the combinations of arguments a user could pass and fix their dtypes if necessary. We should be able to get around that internally...

Won't we still need dtype validation, etc., here? The user-provided inputs to drop() could be of different dtype from the index columns.

It would also help if cudf._lib.join.join had a better "developer facing" API, so the actual call to it from didn't seem so weird.

Agree that we need a better developer-facing join API. FWIW, that won't be coming from Cython though, as the Cython API is going to be very simple (just returning a tuple of gathermaps).

But then, if someone passes something invalid, it's almost just as weird that the error comes from inside join

Maybe we could wrap the call to join here in a try...except and raise a more informative error message?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with that, I guess my only question is in general, should we avoid using our own user facing API entrypoints internally in favor of a lower level API, just to avoid all the extra overhead we put in the user facing APIs that are meant to check every edge case of everything they could possibly be doing wrong?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're in 100% agreement that we should use internal APIs that bypass Pandas semantics for efficiency. My only point is that doesn't always have to be calling all the way down to Cython. For example _gather is an internal API at the Python level.

Here specifically, it looks like we need a join API without all the bells and whistles of the top-level merge() method, and we currently don't really have that.

Copy link
Contributor Author

@isVoid isVoid Feb 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw in drop there is this line:

if errors == "raise" and not labels.isin(levels_index).all():
  raise KeyError("One or more values not found in axis")

which I think guards against the dtype mismatch issue? @brandon-b-miller

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Offline sync with @brandon-b-miller : There is room to optimize for join parameter logic checks, but should defer until the join refactor is finished.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, can we add a TODO here to optimize later by calling a lower-level join API? Ideally, the TODO should document exactly what logic checks we would like to avoid here.

@shwina
Copy link
Contributor

shwina commented Feb 4, 2021

Overall looks great - few minor suggestions and one question about implementation for MultiIndex

@brandon-b-miller
Copy link
Contributor

Had one question/idea otherwise looks great :)

@isVoid isVoid requested a review from shwina February 9, 2021 17:49
@isVoid
Copy link
Contributor Author

isVoid commented Feb 19, 2021

rerun tests

Copy link
Contributor

@shwina shwina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@shwina shwina added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Feb 19, 2021
@galipremsagar
Copy link
Contributor

@gpucibot merge

@galipremsagar
Copy link
Contributor

rerun tests

2 similar comments
@isVoid
Copy link
Contributor Author

isVoid commented Mar 8, 2021

rerun tests

@galipremsagar
Copy link
Contributor

rerun tests

@rapids-bot rapids-bot bot merged commit e628101 into rapidsai:branch-0.19 Mar 10, 2021
hyperbolic2346 pushed a commit to hyperbolic2346/cudf that referenced this pull request Mar 25, 2021
Closes rapidsai#7045 

This PR introduces `Series.drop` API. `Series.drop` allows users to drop certain elements in the series specified `labels` or `index` parameter.

Example:
```python3
>>> s = cudf.Series([1, 2, 3], index=['x', 'y', 'z'])
>>> s.drop(labels=['y'])
x    1
z    3
dtype: int64
```

- [x] Add series test case
- [x] Move common code path from `DataFrame.drop` to helper function
- [x] Add typing annotation
- [x] Add docstring

Authors:
  - Michael Wang (@isVoid)

Approvers:
  - Ashwin Srinath (@shwina)
  - GALI PREM SAGAR (@galipremsagar)

URL: rapidsai#7304
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge feature request New feature or request non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Add support for Series.drop
4 participants