-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add handling for nulls in dask_cudf.sorting.quantile_divisions
#9171
Conversation
divisions = ( | ||
sorted( | ||
divisions.dropna() | ||
.drop_duplicates() | ||
.astype(dtype) | ||
.values.tolist() | ||
) | ||
+ [None] * divisions.null_count | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use .to_arrow
and sort, that way we are tampering the null ordering?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Were you thinking something like
divisions = (
sorted(
divisions.drop_duplicates()
.astype(dtype)
.to_arrow().tolist()
)
In that case, the null ordering is still being tampered by drop_duplicates
, which places nulls first. From there we can't sort the resulting list as it contains None
, unless you meant using an arrow sorting method before calling tolist()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, didn't notice the drop_duplicates
call, yea drop_duplicates
results in non-deterministic ordering. But .to_arrow().tolist()
seems cleaner to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, in that case do you have any preference for how we handle sorting the list with nulls? Doing a quick search, this seems like a suitable solution:
# https://stackoverflow.com/questions/18411560/sort-list-while-pushing-none-values-to-the-end
sorted(..., key=lambda x: (x is None, x))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try sort_indices
of pyarrow, it seems to be doing what you want to do: https://arrow.apache.org/docs/python/generated/pyarrow.compute.sort_indices.html
If that's not possible the StackOverflow approach looks fine to me.
Codecov Report
@@ Coverage Diff @@
## branch-21.10 #9171 +/- ##
===============================================
Coverage ? 10.78%
===============================================
Files ? 115
Lines ? 19113
Branches ? 0
===============================================
Hits ? 2062
Misses ? 17051
Partials ? 0 Continue to review full report at Codecov.
|
rerun tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a reasonable change to me. Thanks @charlesbluca !
Thanks @charlesbluca ! |
@gpucibot merge |
Closes #9157
Originally,
dask_cudf.DataFrame.sort_values
would fail if the DataFrame had enough null values thatdivisions
contained nulls here:cudf/python/dask_cudf/dask_cudf/sorting.py
Lines 189 to 191 in 1935a8a
As you cannot get the
values
of a Series containing nulls; this PR drops the nulls fromdivisions
before callingvalues
, appending the correct amount to the resulting list afterwards to avoid this failure.