[REVIEW] Add Series.loc and DataFrame.loc #1622

shwina · 2019-05-03T22:10:10Z

🔥 😤 MAKE INDEXING IN CUDF GREAT AGAIN 😤 🔥

This PR brings changes and improvements to make indexing in cuDF more performant, and feel more natural and Pandas-like.

cuDF (latest):

In [4]: print(cudf.__version__)                                                                                                                                                         
0.7.1+0.g7ee0348.dirty

In [5]: SIZE = 100_000_000                                                                                                                                                              
In [6]: df = pd.DataFrame(np.random.rand(SIZE, 2), index=(SIZE+np.arange(SIZE)))                                                                                                       
In [7]: gdf = cudf.from_pandas(df)                                                                                                                                                      
In [8]: %timeit gdf.loc[SIZE:SIZE+5000]                                                                                                                                                 
1.27 s +- 46.9 ms per loop (mean +- std. dev. of 7 runs, 1 loop each)

0.8 (this PR):

In [4]: print(cudf.__version__)                                                            
0.8.0a+1044.g61898e6.dirty

In [5]: SIZE = 100_000_000                                                                 
In [6]: df = pd.DataFrame(np.random.rand(SIZE, 2), index=(SIZE+np.arange(SIZE)))           
In [7]: gdf = cudf.from_pandas(df)                                                         
In [8]: %timeit gdf.loc[SIZE:SIZE+5000]                                                    
77.9 ms ± 19.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Fixes:

#1494
#1513
#1731
#1459
#1444

into improve-series-indexing

…calColumn

…prove-series-indexing

shwina · 2019-05-06T16:11:09Z

Note that for the scalar index case it's possible to just implement loc as a combination of compare+gather:

In [2]: a = cudf.Series(['one', 'two', 'three'], index=[1, 2, 3])                                                                                                                       

In [3]: print(a[a.index == 1])                                                                                                                                                          
1    one
dtype: object

However, this will need to wait for the Cythonized gather (#1604) to work for all index types. Until then, I have implemented the placeholder find_first_value functions.

python/cudf/dataframe/index.py

kkraus14

Mostly needs to handle the null cases, otherwise additional changes can be done in a follow up PR

python/cudf/dataframe/numerical.py

python/cudf/dataframe/series.py

python/cudf/dataframe/string.py

kkraus14 · 2019-05-16T16:31:59Z

python/cudf/indexing.py

+            if len(arg) == 0:
+                arg = Series(np.array([], dtype='int32'))
+            else:
+                arg = Series(arg)


We could likely operate better against a column than a Series here since I assume this input argument is converted to a Series just to move it to device. That way there isn't any unnecessary index handling when creating these Series objects.

See comment below

python/cudf/indexing.py

kkraus14 · 2019-05-16T17:05:04Z

python/cudf/indexing.py

+            or is_datetime_or_timedelta_dtype(val)
+            or isinstance(val, pd.Timestamp)
+            or isinstance(val, pd.Categorical)
+    )


May want to move this to utils so it can be reused elsewhere

…prove-series-indexing # Conflicts: # CHANGELOG.md

…prove-series-indexing

into improve-series-indexing

kkraus14

Will need some follow ups for cleanup / optimization, but vastly improves the current state. Great work @shwina!

harrism · 2019-05-21T01:55:56Z

@shwina I can approve but you need to do a merge to resolve the conflicts because I don't want to mess it up.

harrism

No C++ changes...

…prove-series-indexing # Conflicts: # CHANGELOG.md # python/cudf/dataframe/column.py

shwina · 2019-05-21T15:01:38Z

@kkraus14 @harrism

Conflicts resolved and tests passing

shwina added 9 commits May 3, 2019 08:01

Add find_first to cudautils

13298d6

Add a basic Series.loc

7f4a4a2

Add simple slice input to Series.loc

2776ff4

Fix Series.loc behaviour on slices and add simple tests for loc

a83b283

Add find_first to cudautils

755082b

Add a basic Series.loc

b81e81a

Add simple slice input to Series.loc

9ff4d1d

Fix Series.loc behaviour on slices and add simple tests for loc

eb7bdcb

Merge branch 'improve-series-indexing' of https://github.com/shwina/cudf

bcd56ed

into improve-series-indexing

shwina requested a review from a team as a code owner May 3, 2019 22:10

shwina added 5 commits May 4, 2019 11:18

Move GPU versions of find_first_value and find_last_value into Numeri…

65c72ba

…calColumn

Add find_first_value and find_last_value for string columns

4df9bf3

Begin work on loc for StringIndex

d3b3789

Merge branch 'branch-0.8' of https://github.com/rapidsai/cudf into im…

04ec008

…prove-series-indexing

Add CHANGELOG

0fdb4e9

shwina added 14 commits May 6, 2019 09:13

Replace use of to_gpu_array() with mem

7e6a558

Simplify string _find_first_and_last_value

80f443d

Add StringColumn.normalize_binop_value

9dfd39d

Add support for loc with DatetimeIndex

f165e0b

Allow creating buffer from an empty list

e5abd2e

Add initial support for loc[] with list input

938d3a4

Undo a change to DatetimeColumn.fillna() that broke things

9160d5a

Fix various issues introduced while implementing loc[]

6142207

Allow creating DatetimeColumn from timestamps of any resolution

6a1e5cd

Allow creating DatetimeIndex from lists and tuples

8a3031d

Improvements to Series.loc arg handling

d849fb1

Split up tests for Series.loc and add a few more test cases

2f675a7

Support slice in Series.loc

4dfbebe

Add take() method for StringIndex

416ede0

kkraus14 reviewed May 16, 2019

View reviewed changes

python/cudf/dataframe/index.py Show resolved Hide resolved

kkraus14 requested changes May 16, 2019

View reviewed changes

shwina added 8 commits May 16, 2019 10:20

Merge branch 'branch-0.8' of https://github.com/rapidsai/cudf into im…

1d21199

…prove-series-indexing # Conflicts: # CHANGELOG.md

Add assertions for null_count when creating indexes

cff72cd

Fix typo

a45c19e

Fix typo

124da80

Merge branch 'branch-0.8' of https://github.com/rapidsai/cudf into im…

ff8a110

…prove-series-indexing

Merge branch 'branch-0.8' of https://github.com/rapidsai/cudf into im…

825b7f9

…prove-series-indexing

Merge branch 'improve-series-indexing' of https://github.com/shwina/cudf

03a29b2

into improve-series-indexing

Update docstring for Series.append

c4308e9

shwina requested a review from kkraus14 May 16, 2019 20:23

kkraus14 approved these changes May 17, 2019

View reviewed changes

kkraus14 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels May 17, 2019

harrism approved these changes May 21, 2019

View reviewed changes

Merge branch 'branch-0.8' of https://github.com/rapidsai/cudf into im…

a6d4303

…prove-series-indexing # Conflicts: # CHANGELOG.md # python/cudf/dataframe/column.py

shwina mentioned this pull request May 21, 2019

[REVIEW] Add Series.dropna() #1807

Merged

harrism merged commit ff31727 into rapidsai:branch-0.8 May 21, 2019

thomcom mentioned this pull request May 24, 2019

[BUG] Groupby aggregations on multiple columns fail with second column not found rapidsai/dask-cudf#125

Closed

eriknw mentioned this pull request Oct 26, 2022

Fix bug where df.loc resulting in single row could give wrong index #11998

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] Add Series.loc and DataFrame.loc #1622

[REVIEW] Add Series.loc and DataFrame.loc #1622

shwina commented May 3, 2019 •

edited

Loading

shwina commented May 6, 2019 •

edited

Loading

kkraus14 left a comment

kkraus14 May 16, 2019

shwina May 16, 2019

kkraus14 May 16, 2019

kkraus14 left a comment

harrism commented May 21, 2019

harrism left a comment

shwina commented May 21, 2019

[REVIEW] Add Series.loc and DataFrame.loc #1622

[REVIEW] Add Series.loc and DataFrame.loc #1622

Conversation

shwina commented May 3, 2019 • edited Loading

shwina commented May 6, 2019 • edited Loading

kkraus14 left a comment

Choose a reason for hiding this comment

kkraus14 May 16, 2019

Choose a reason for hiding this comment

shwina May 16, 2019

Choose a reason for hiding this comment

kkraus14 May 16, 2019

Choose a reason for hiding this comment

kkraus14 left a comment

Choose a reason for hiding this comment

harrism commented May 21, 2019

harrism left a comment

Choose a reason for hiding this comment

shwina commented May 21, 2019

shwina commented May 3, 2019 •

edited

Loading

shwina commented May 6, 2019 •

edited

Loading