Use list of column inputs for `apply_boolean_mask` #9832

isVoid · 2021-12-02T22:57:49Z

This PR brings changes from #9558 to apply_boolean_mask and removes the as_frame -> as_column round trip. Benchmark the column method:

------------------------------------- benchmark 'col0': 2 tests -------------------------------------
Name (time in us)                               Min                 Max                Mean          
-----------------------------------------------------------------------------------------------------
column_apply_boolean_mask[col0] (afte)      87.0090 (1.0)      132.8980 (1.0)       95.8815 (1.0)    
column_apply_boolean_mask[col0] (befo)     210.4580 (2.42)     307.8270 (2.32)     225.4821 (2.35)   
-----------------------------------------------------------------------------------------------------

------------------------------------- benchmark 'col1': 2 tests -------------------------------------
Name (time in us)                               Min                 Max                Mean          
-----------------------------------------------------------------------------------------------------
column_apply_boolean_mask[col1] (afte)      74.2240 (1.0)      110.0600 (1.0)       75.6356 (1.0)    
column_apply_boolean_mask[col1] (befo)     172.5240 (2.32)     278.5250 (2.53)     176.5672 (2.33)   
-----------------------------------------------------------------------------------------------------

------------------------------------- benchmark 'col2': 2 tests -------------------------------------
Name (time in us)                               Min                 Max                Mean          
-----------------------------------------------------------------------------------------------------
column_apply_boolean_mask[col2] (afte)     101.5740 (1.0)      141.8850 (1.0)      110.2334 (1.0)    
column_apply_boolean_mask[col2] (befo)     234.1140 (2.30)     312.7140 (2.20)     245.5453 (2.23)   
-----------------------------------------------------------------------------------------------------

------------------------------------- benchmark 'col3': 2 tests -------------------------------------
Name (time in us)                               Min                 Max                Mean          
-----------------------------------------------------------------------------------------------------
column_apply_boolean_mask[col3] (afte)      88.7710 (1.0)      142.7500 (1.0)       90.5082 (1.0)    
column_apply_boolean_mask[col3] (befo)     195.0980 (2.20)     303.1020 (2.12)     199.8368 (2.21)   
-----------------------------------------------------------------------------------------------------

Dataframe benchmark

----------------------------------- benchmark '100': 2 tests -----------------------------------
Name (time in us)                          Min                 Max                Mean          
------------------------------------------------------------------------------------------------
df_apply_boolean_mask[100] (afte)     380.6770 (1.05)     654.7080 (1.18)     389.3374 (1.03)   
df_apply_boolean_mask[100] (befo)     362.3220 (1.0)      554.6130 (1.0)      378.7087 (1.0)    
------------------------------------------------------------------------------------------------

----------------------------------- benchmark '10000': 2 tests -----------------------------------
Name (time in us)                            Min                 Max                Mean          
--------------------------------------------------------------------------------------------------
df_apply_boolean_mask[10000] (afte)     399.5240 (1.05)     461.6310 (1.0)      405.1225 (1.04)   
df_apply_boolean_mask[10000] (befo)     379.4080 (1.0)      564.5770 (1.22)     389.6990 (1.0)    
--------------------------------------------------------------------------------------------------

codecov · 2021-12-03T00:34:28Z

Codecov Report

Merging #9832 (33be855) into branch-22.02 (967a333) will decrease coverage by 0.07%.
The diff coverage is n/a.

@@               Coverage Diff                @@
##           branch-22.02    #9832      +/-   ##
================================================
- Coverage         10.49%   10.41%   -0.08%     
================================================
  Files               119      119              
  Lines             20305    20484     +179     
================================================
+ Hits               2130     2134       +4     
- Misses            18175    18350     +175

Impacted Files	Coverage Δ
python/custreamz/custreamz/kafka.py	`29.16% <0.00%> (-0.63%)`	⬇️
python/dask_cudf/dask_cudf/sorting.py	`92.30% <0.00%> (-0.61%)`	⬇️
python/cudf/cudf/__init__.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/frame.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/index.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/io/parquet.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/series.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/utils/utils.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/utils/dtypes.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/utils/ioutils.py	`0.00% <0.00%> (ø)`
... and 19 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0d5ec7f...33be855. Read the comment docs.

python/cudf/cudf/core/_base_index.py

…ent/use_list_of_columns_apply_boolean_mask

vyasr

I don't have much to add here, the code looks good. However, the methods in BaseIndex and IndexedFrame are identical aside from the passing of the index columns to the _from_columns, so while I think we can move ahead with this for now we should keep thinking about how to abstract away the column selection to pass the Cython APIs. Once all Cython APIs are transitioned to the list of columns approach I think that's the next frontier in simplifying these code paths. We've discussed it before but not come up with a satisfactory resolution IIRC.

…ent/use_list_of_columns_apply_boolean_mask

isVoid · 2022-01-11T23:16:22Z

@gpucibot merge

initial

012bae5

isVoid requested a review from a team as a code owner December 2, 2021 22:57

isVoid requested review from galipremsagar and brandon-b-miller December 2, 2021 22:57

isVoid added the non-breaking Non-breaking change label Dec 2, 2021

github-actions bot added the Python Affects Python cuDF API. label Dec 2, 2021

isVoid added 3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function and removed Python Affects Python cuDF API. labels Dec 2, 2021

isVoid self-assigned this Dec 2, 2021

ttnghia added the Python Affects Python cuDF API. label Dec 3, 2021

shwina reviewed Dec 3, 2021

View reviewed changes

python/cudf/cudf/core/_base_index.py Show resolved Hide resolved

galipremsagar approved these changes Dec 7, 2021

View reviewed changes

isVoid requested a review from shwina December 15, 2021 00:35

isVoid added 2 commits December 14, 2021 16:37

Merge branch 'branch-22.02' of github.com:rapidsai/cudf into improvem…

5f65803

…ent/use_list_of_columns_apply_boolean_mask

Merge branch 'branch-22.02' of github.com:rapidsai/cudf into improvem…

2730d47

…ent/use_list_of_columns_apply_boolean_mask

vyasr approved these changes Jan 6, 2022

View reviewed changes

isVoid added 2 commits January 7, 2022 20:17

Merge branch 'branch-22.02' of github.com:rapidsai/cudf into improvem…

7d30dc8

…ent/use_list_of_columns_apply_boolean_mask

Merge branch 'branch-22.02' of github.com:rapidsai/cudf into improvem…

33be855

…ent/use_list_of_columns_apply_boolean_mask

isVoid added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Jan 11, 2022

rapids-bot bot merged commit 813ac97 into rapidsai:branch-22.02 Jan 11, 2022

This was referenced Jan 14, 2022

Move drop_duplicates, drop_na, _gather, take to IndexFrame and create their _base_index counterparts #9807

Merged

BaseIndex and IndexedFrame has overlapping logics #10068

Open

wence- mentioned this pull request Oct 26, 2022

Fix bug where df.loc resulting in single row could give wrong index #11998

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use list of column inputs for `apply_boolean_mask` #9832

Use list of column inputs for `apply_boolean_mask` #9832

isVoid commented Dec 2, 2021 •

edited

Loading

codecov bot commented Dec 3, 2021 •

edited

Loading

vyasr left a comment

isVoid commented Jan 11, 2022

Use list of column inputs for apply_boolean_mask #9832

Use list of column inputs for apply_boolean_mask #9832

Conversation

isVoid commented Dec 2, 2021 • edited Loading

codecov bot commented Dec 3, 2021 • edited Loading

Codecov Report

vyasr left a comment

Choose a reason for hiding this comment

isVoid commented Jan 11, 2022

Use list of column inputs for `apply_boolean_mask` #9832

Use list of column inputs for `apply_boolean_mask` #9832

isVoid commented Dec 2, 2021 •

edited

Loading

codecov bot commented Dec 3, 2021 •

edited

Loading