You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
When a DataFrame is having it's index as RangeIndex and it's already sorted, there is no need to materialize and sort RangeIndex again. This also deviates from pandas behavior:
Fixes: #9234
- [x] This PR introduces optimizations to `sort_index` when there is an already sorted `Index` object and avoids sorting them and performing a `take` operation on them. This **alleviates** a lot of **memory pressure** and has **a 3x to 6x speed up.**
On a T4 GPU:
`This PR`:
```python
In [1]: import cudf
In [2]: df = cudf.DataFrame({'a':[1, 2, 3]*100000000, 'b':['a', 'b', 'c']*100000000, 'c':[0.0, 0.12, 10.12]*100000000})
In [3]: %timeit df.sort_index()
174 ms ± 368 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```
`branch-21.10`:
Won't fit into memory and will error :( on T4 as it tries to perform argsort on an already sorted index.
`THIS PR`:
```python
In [1]: import cudf
In [2]: df = cudf.DataFrame({'a':[1, 2, 3]*10000000, 'b':['a', 'b', 'c']*10000000, 'c':[0.0, 0.12, 10.12]*10000000})
In [3]: %timeit df.sort_index(ascending=False)
69.1 ms ± 221 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [4]: %timeit df.sort_index()
15.2 ms ± 213 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [5]: df_reversed = df[::-1]
In [6]: %timeit df_reversed.sort_index()
72.6 ms ± 433 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [7]: %timeit df_reversed.sort_index(ascending=False)
24.1 ms ± 423 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```
`branch-21.10`:
```python
In [1]: import cudf
In [2]: df = cudf.DataFrame({'a':[1, 2, 3]*10000000, 'b':['a', 'b', 'c']*10000000, 'c':[0.0, 0.12, 10.12]*10000000})
In [3]: %timeit df.sort_index(ascending=False)
71.6 ms ± 141 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [4]: %timeit df.sort_index()
71.7 ms ± 189 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [5]: df_reversed = df[::-1]
In [6]: %timeit df_reversed.sort_index()
69.1 ms ± 201 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [7]: %timeit df_reversed.sort_index(ascending=False)
69 ms ± 127 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```
- [x] Also expands params to `Series.sort_index` and refactored the common implementation to `Frame._sort_index`.
Authors:
- GALI PREM SAGAR (https://github.com/galipremsagar)
Approvers:
- Michael Wang (https://github.com/isVoid)
- Benjamin Zaitlen (https://github.com/quasiben)
URL: #9238
Describe the bug
When a
DataFrame
is having it's index asRangeIndex
and it's already sorted, there is no need to materialize and sort RangeIndex again. This also deviates frompandas
behavior:Steps/Code to reproduce bug
Expected behavior
The text was updated successfully, but these errors were encountered: