-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Dataframe.sort_index
optimizations
#9238
[REVIEW] Dataframe.sort_index
optimizations
#9238
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-21.10 #9238 +/- ##
===============================================
Coverage ? 10.84%
===============================================
Files ? 115
Lines ? 18768
Branches ? 0
===============================================
Hits ? 2035
Misses ? 16733
Partials ? 0 Continue to review full report at Codecov.
|
…ort_index_optimizations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides below, consolidate df interface and series interface altogether as part of #9038 ?
elif (ascending and self.index.is_monotonic_increasing) or ( | ||
not ascending and self.index.is_monotonic_decreasing | ||
): | ||
outdf = self.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering, is_monotonic_*
is available for both Index and MultiIndex. Maybe this optimization can be applied regardless of object type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would have to adhere to extracting level, which will be a DataFrame
and again round-trip that back to MultiIndex
object to do an is_monotonic_*
check which seems to be inefficient and memory consuming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also out of the context of this PR.. I can see the reason why we need to convert the index into a dataframe is because it's depending on argsort
and take
. Hopefully we can sink them into Frame
so that there's no such need to convert to dataframes.
The difficulty of sinking argsort
is that I believe Series
depends on a single column sort while DataFrame
depends on a multi column sort.
Co-authored-by: Michael Wang <[email protected]>
…om/galipremsagar/cudf into dataframe_sort_index_optimizations
This is a special case because we might want to avoid |
@gpucibot merge |
Fixes: #9234
sort_index
when there is an already sortedIndex
object and avoids sorting them and performing atake
operation on them. This alleviates a lot of memory pressure and has a 3x to 6x speed up.On a T4 GPU:
This PR
:branch-21.10
:Won't fit into memory and will error :( on T4 as it tries to perform argsort on an already sorted index.
THIS PR
:branch-21.10
:Series.sort_index
and refactored the common implementation toFrame._sort_index
.