[BUG] Performance bottleneck in `DataFrame.sort_index` when there is a `RangeIndex` #9234

galipremsagar · 2021-09-15T19:49:04Z

Describe the bug
When a DataFrame is having it's index as RangeIndex and it's already sorted, there is no need to materialize and sort RangeIndex again. This also deviates from pandas behavior:

Steps/Code to reproduce bug

>>> import cudf
>>> df = cudf.DataFrame({'a':[1, 2, 3]})
>>> df
   a
0  1
1  2
2  3
>>> df.index
RangeIndex(start=0, stop=3, step=1)
>>> df.sort_index().index
Int64Index([0, 1, 2], dtype='int64')
>>> df.to_pandas().sort_index().index
RangeIndex(start=0, stop=3, step=1)

Expected behavior

>>> df.sort_index().index
RangeIndex(start=0, stop=3, step=1)
>>> df.to_pandas().sort_index().index
RangeIndex(start=0, stop=3, step=1)

The text was updated successfully, but these errors were encountered:

Fixes: #9234 - [x] This PR introduces optimizations to `sort_index` when there is an already sorted `Index` object and avoids sorting them and performing a `take` operation on them. This **alleviates** a lot of **memory pressure** and has **a 3x to 6x speed up.** On a T4 GPU: `This PR`: ```python In [1]: import cudf In [2]: df = cudf.DataFrame({'a':[1, 2, 3]*100000000, 'b':['a', 'b', 'c']*100000000, 'c':[0.0, 0.12, 10.12]*100000000}) In [3]: %timeit df.sort_index() 174 ms ± 368 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) ``` `branch-21.10`: Won't fit into memory and will error :( on T4 as it tries to perform argsort on an already sorted index. `THIS PR`: ```python In [1]: import cudf In [2]: df = cudf.DataFrame({'a':[1, 2, 3]*10000000, 'b':['a', 'b', 'c']*10000000, 'c':[0.0, 0.12, 10.12]*10000000}) In [3]: %timeit df.sort_index(ascending=False) 69.1 ms ± 221 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [4]: %timeit df.sort_index() 15.2 ms ± 213 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [5]: df_reversed = df[::-1] In [6]: %timeit df_reversed.sort_index() 72.6 ms ± 433 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [7]: %timeit df_reversed.sort_index(ascending=False) 24.1 ms ± 423 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) ``` `branch-21.10`: ```python In [1]: import cudf In [2]: df = cudf.DataFrame({'a':[1, 2, 3]*10000000, 'b':['a', 'b', 'c']*10000000, 'c':[0.0, 0.12, 10.12]*10000000}) In [3]: %timeit df.sort_index(ascending=False) 71.6 ms ± 141 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [4]: %timeit df.sort_index() 71.7 ms ± 189 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [5]: df_reversed = df[::-1] In [6]: %timeit df_reversed.sort_index() 69.1 ms ± 201 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [7]: %timeit df_reversed.sort_index(ascending=False) 69 ms ± 127 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) ``` - [x] Also expands params to `Series.sort_index` and refactored the common implementation to `Frame._sort_index`. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Michael Wang (https://github.com/isVoid) - Benjamin Zaitlen (https://github.com/quasiben) URL: #9238

galipremsagar added bug Something isn't working Python Affects Python cuDF API. labels Sep 15, 2021

galipremsagar self-assigned this Sep 15, 2021

galipremsagar mentioned this issue Sep 16, 2021

[REVIEW] Dataframe.sort_index optimizations #9238

Merged

2 tasks

rapids-bot bot closed this as completed in #9238 Sep 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Performance bottleneck in `DataFrame.sort_index` when there is a `RangeIndex` #9234

[BUG] Performance bottleneck in `DataFrame.sort_index` when there is a `RangeIndex` #9234

galipremsagar commented Sep 15, 2021

[BUG] Performance bottleneck in DataFrame.sort_index when there is a RangeIndex #9234

[BUG] Performance bottleneck in DataFrame.sort_index when there is a RangeIndex #9234

Comments

galipremsagar commented Sep 15, 2021

[BUG] Performance bottleneck in `DataFrame.sort_index` when there is a `RangeIndex` #9234

[BUG] Performance bottleneck in `DataFrame.sort_index` when there is a `RangeIndex` #9234