-
Notifications
You must be signed in to change notification settings - Fork 907
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Series.drop
api
#7304
Add Series.drop
api
#7304
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-0.19 #7304 +/- ##
===============================================
+ Coverage 81.86% 82.38% +0.51%
===============================================
Files 101 101
Lines 16884 17328 +444
===============================================
+ Hits 13822 14275 +453
+ Misses 3062 3053 -9
Continue to review full report at Codecov.
|
python/cudf/cudf/core/dataframe.py
Outdated
|
||
labels: a list of labels specifying the rows to drop | ||
""" | ||
return self.join(cudf.DataFrame(index=labels), how="leftanti") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice - very elegant!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the join parameters are fixed here, it might be worth jumping straight to the cython without first creating a dataframe. Something like
idx = Table(index=labels)
res = cudf._lib.join.join(self, idx, 'leftanti', None, left_on=[], right_on=[], left_index=True, right_index=True)
return self.__class__._from_table(res)
This way you avoid all the python overhead of constructing a dataframe and then a bunch of unecessary checking and typecasting that the merge codepath uses, which isn't necessarily needed in this case. The downside is you might want to throw if the labels passed dont match the dtype of the target series index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a fast path that could exist in the merge code that we're not currently enabling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how useful it would be for users, but after the upcoming join improvements it might be useful as an internal utility for join
at least, since join is just a narrow case of merge with left_index=True
, right_index=True
and no on
parameters.
A lot of the merge code is there just to validate all the combinations of arguments a user could pass and fix their dtypes if necessary. We should be able to get around that internally if we know something about what our cython API expects and are sure what we are passing to it is right. It would also help if cudf._lib.join.join
had a better "developer facing" API, so the actual call to it from didn't seem so weird.
The counterargument is that to avoid libcudf itself erroring in an invalid case, we have to do a dtype check, to make sure the labels and the index are the same type. It could be argued that we might as well let the full user facing merge code error since it's already built to error in that case. But then, if someone passes something invalid, it's almost just as weird that the error comes from inside join.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of the merge code is there just to validate all the combinations of arguments a user could pass and fix their dtypes if necessary. We should be able to get around that internally...
Won't we still need dtype validation, etc., here? The user-provided inputs to drop()
could be of different dtype from the index columns.
It would also help if cudf._lib.join.join had a better "developer facing" API, so the actual call to it from didn't seem so weird.
Agree that we need a better developer-facing join API. FWIW, that won't be coming from Cython though, as the Cython API is going to be very simple (just returning a tuple of gathermaps).
But then, if someone passes something invalid, it's almost just as weird that the error comes from inside join
Maybe we could wrap the call to join
here in a try...except
and raise a more informative error message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with that, I guess my only question is in general, should we avoid using our own user facing API entrypoints internally in favor of a lower level API, just to avoid all the extra overhead we put in the user facing APIs that are meant to check every edge case of everything they could possibly be doing wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're in 100% agreement that we should use internal APIs that bypass Pandas semantics for efficiency. My only point is that doesn't always have to be calling all the way down to Cython. For example _gather
is an internal API at the Python level.
Here specifically, it looks like we need a join
API without all the bells and whistles of the top-level merge()
method, and we currently don't really have that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw in drop there is this line:
if errors == "raise" and not labels.isin(levels_index).all():
raise KeyError("One or more values not found in axis")
which I think guards against the dtype mismatch issue? @brandon-b-miller
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Offline sync with @brandon-b-miller : There is room to optimize for join parameter logic checks, but should defer until the join
refactor is finished.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, can we add a TODO here to optimize later by calling a lower-level join API? Ideally, the TODO should document exactly what logic checks we would like to avoid here.
Overall looks great - few minor suggestions and one question about implementation for |
Had one question/idea otherwise looks great :) |
rerun tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@gpucibot merge |
rerun tests |
2 similar comments
rerun tests |
rerun tests |
Closes rapidsai#7045 This PR introduces `Series.drop` API. `Series.drop` allows users to drop certain elements in the series specified `labels` or `index` parameter. Example: ```python3 >>> s = cudf.Series([1, 2, 3], index=['x', 'y', 'z']) >>> s.drop(labels=['y']) x 1 z 3 dtype: int64 ``` - [x] Add series test case - [x] Move common code path from `DataFrame.drop` to helper function - [x] Add typing annotation - [x] Add docstring Authors: - Michael Wang (@isVoid) Approvers: - Ashwin Srinath (@shwina) - GALI PREM SAGAR (@galipremsagar) URL: rapidsai#7304
Closes #7045
This PR introduces
Series.drop
API.Series.drop
allows users to drop certain elements in the series specifiedlabels
orindex
parameter.Example:
DataFrame.drop
to helper function