Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Add Series.loc and DataFrame.loc #1622

Merged
merged 69 commits into from
May 21, 2019

Conversation

shwina
Copy link
Contributor

@shwina shwina commented May 3, 2019

🔥 😤 MAKE INDEXING IN CUDF GREAT AGAIN 😤 🔥

This PR brings changes and improvements to make indexing in cuDF more performant, and feel more natural and Pandas-like.

cuDF (latest):

In [4]: print(cudf.__version__)                                                                                                                                                         
0.7.1+0.g7ee0348.dirty

In [5]: SIZE = 100_000_000                                                                                                                                                              
In [6]: df = pd.DataFrame(np.random.rand(SIZE, 2), index=(SIZE+np.arange(SIZE)))                                                                                                       
In [7]: gdf = cudf.from_pandas(df)                                                                                                                                                      
In [8]: %timeit gdf.loc[SIZE:SIZE+5000]                                                                                                                                                 
1.27 s +- 46.9 ms per loop (mean +- std. dev. of 7 runs, 1 loop each)

0.8 (this PR):

In [4]: print(cudf.__version__)                                                            
0.8.0a+1044.g61898e6.dirty

In [5]: SIZE = 100_000_000                                                                 
In [6]: df = pd.DataFrame(np.random.rand(SIZE, 2), index=(SIZE+np.arange(SIZE)))           
In [7]: gdf = cudf.from_pandas(df)                                                         
In [8]: %timeit gdf.loc[SIZE:SIZE+5000]                                                    
77.9 ms ± 19.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Fixes:

#1494
#1513
#1731
#1459
#1444

@shwina shwina requested a review from a team as a code owner May 3, 2019 22:10
@shwina
Copy link
Contributor Author

shwina commented May 6, 2019

Note that for the scalar index case it's possible to just implement loc as a combination of compare+gather:

In [2]: a = cudf.Series(['one', 'two', 'three'], index=[1, 2, 3])                                                                                                                       

In [3]: print(a[a.index == 1])                                                                                                                                                          
1    one
dtype: object

However, this will need to wait for the Cythonized gather (#1604) to work for all index types. Until then, I have implemented the placeholder find_first_value functions.

Copy link
Collaborator

@kkraus14 kkraus14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly needs to handle the null cases, otherwise additional changes can be done in a follow up PR

python/cudf/dataframe/numerical.py Show resolved Hide resolved
python/cudf/dataframe/numerical.py Show resolved Hide resolved
python/cudf/dataframe/series.py Outdated Show resolved Hide resolved
python/cudf/dataframe/string.py Show resolved Hide resolved
if len(arg) == 0:
arg = Series(np.array([], dtype='int32'))
else:
arg = Series(arg)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could likely operate better against a column than a Series here since I assume this input argument is converted to a Series just to move it to device. That way there isn't any unnecessary index handling when creating these Series objects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment below

python/cudf/indexing.py Show resolved Hide resolved
python/cudf/indexing.py Show resolved Hide resolved
or is_datetime_or_timedelta_dtype(val)
or isinstance(val, pd.Timestamp)
or isinstance(val, pd.Categorical)
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May want to move this to utils so it can be reused elsewhere

@shwina shwina requested a review from kkraus14 May 16, 2019 20:23
Copy link
Collaborator

@kkraus14 kkraus14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will need some follow ups for cleanup / optimization, but vastly improves the current state. Great work @shwina!

@kkraus14 kkraus14 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels May 17, 2019
@harrism
Copy link
Member

harrism commented May 21, 2019

@shwina I can approve but you need to do a merge to resolve the conflicts because I don't want to mess it up.

Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No C++ changes...

…prove-series-indexing

# Conflicts:
#	CHANGELOG.md
#	python/cudf/dataframe/column.py
@shwina
Copy link
Contributor Author

shwina commented May 21, 2019

@kkraus14 @harrism

Conflicts resolved and tests passing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants