Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dispatch frame methods to series versions instead of re-implementing masking etc #19611

Merged
merged 13 commits into from
Feb 18, 2018

Conversation

jbrockmendel
Copy link
Member

This moves some of DataFrame's dispatching logic into ops, mostly in ways that don't change any logic.

The place that logic is changed is in DataFrame._combine_frame where instead of defining arith_op to mask func and wrapping {col: arith_op(this[col].values, other[col].values) for col in this.columns}, we dispatch directly to the Series methods and wrap {col: func(this[col], other[col]) for col in this.columns}

def _combine_match_index(self, other, func, level=None):
left, right = self.align(other, join='outer', axis=0, level=level,
copy=False)
return self._constructor(func(left.values.T, right.values).T,
new_data = func(left.values.T, right.values).T
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can do better than operating on left.values.T and right.values. Especially with mixed-dtypes this seems like low-hanging fruit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most of these types of operations need to be pushed down to internals via block operations. this should dispatch to ._data.eval to do this (like the other match functions)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yah, but I'm not there yet. These particular cases are tricky because SparseDataFrame implements _combine_match_foo differently

result = self._constructor(result, index=new_index, copy=False)
result.columns = new_columns
return result
return result

else:
result = _arith_op(this.values, other.values)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to dispatch to Series here, but presumably doing the operation once instead of per-column has a perf benefit. One idea that didn't work on the first try was to ravel() this.values and other.values, wrap them in Series, then operate and re-shape.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is pretty tricky, yes that would work for most ufuncs, but is not generally applicable e.g. concat is a ufunc too.
so hold off on this as other low hanging fruit w/o opening up this box.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more to the point is to push this down to the block manager for execution which handles the mixed dtypes case, this is just a special case of that.

@codecov
Copy link

codecov bot commented Feb 9, 2018

Codecov Report

Merging #19611 into master will decrease coverage by <.01%.
The diff coverage is 94.23%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #19611      +/-   ##
==========================================
- Coverage   91.58%   91.57%   -0.01%     
==========================================
  Files         150      150              
  Lines       48867    48856      -11     
==========================================
- Hits        44755    44741      -14     
- Misses       4112     4115       +3
Flag Coverage Δ
#multiple 89.95% <94.23%> (-0.01%) ⬇️
#single 41.75% <40.38%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/frame.py 97.22% <100%> (+0.06%) ⬆️
pandas/core/indexes/base.py 96.45% <93.33%> (-0.02%) ⬇️
pandas/core/ops.py 96.44% <93.33%> (-0.39%) ⬇️
pandas/util/testing.py 83.64% <0%> (-0.21%) ⬇️
pandas/core/series.py 94.56% <0%> (-0.01%) ⬇️
pandas/core/indexes/api.py 98.78% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2fdf1e2...118cd5d. Read the comment docs.

def _combine_match_index(self, other, func, level=None):
left, right = self.align(other, join='outer', axis=0, level=level,
copy=False)
return self._constructor(func(left.values.T, right.values).T,
new_data = func(left.values.T, right.values).T
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most of these types of operations need to be pushed down to internals via block operations. this should dispatch to ._data.eval to do this (like the other match functions)

@jreback
Copy link
Contributor

jreback commented Feb 10, 2018

The place that logic is changed is in DataFrame._combine_frame where instead of defining arith_op to mask func and wrapping {col: arith_op(this[col].values, other[col].values) for col in this.columns}, we dispatch directly to the Series methods and wrap {col: func(this[col], other[col]) for col in this.columns}

does this change anything from a user pov?

@jreback jreback added Numeric Operations Arithmetic, Comparison, and Logical operations Clean labels Feb 10, 2018
@jbrockmendel
Copy link
Member Author

does this change anything from a user pov?

No, but I may have made a mistake here. Pls put a pin in this until I double-check.

@jbrockmendel
Copy link
Member Author

I did in fact make a mistake, but there are no existing tests that catch it. Next push will fix both of these.

@jbrockmendel
Copy link
Member Author

Will hold off on this until #19613 is closed.

@@ -998,6 +998,33 @@ def flex_wrapper(self, other, level=None, fill_value=None, axis=0):
# -----------------------------------------------------------------------------
# DataFrame

def _combine_series_frame(self, other, func, fill_value=None, axis=None,
level=None, try_cast=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a doc-string

@jreback
Copy link
Contributor

jreback commented Feb 13, 2018

rebase

@jreback
Copy link
Contributor

jreback commented Feb 15, 2018

did you have another PR which touched this code and we wanted to merge first? or is this one ok? if so rebase

@jbrockmendel
Copy link
Member Author

There was, and it was merged a few days ago. #19613. This should be good to go (after rebase)

@jbrockmendel
Copy link
Member Author

jbrockmendel commented Feb 15, 2018

#19649 and #19582 touch nearby code, but distinct enough that it shouldn't cause a problem.

@jbrockmendel jbrockmendel mentioned this pull request Feb 16, 2018
4 tasks
invalid_op : function
"""
def invalid_op(self, other=None):
raise TypeError("cannot perform {name} with this index type: "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might need to parameterize this on the work 'index' but not sure (IOW can it be a Series)? or maybe just remove the word index ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ATM it is only used for indexes. I could add an assertion that self is an Index so that we know to change it if/when the wording ceases to be accurate.

self : DataFrame
other : Series
func : binary operator
fill_value : object (default None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the format
fill_value : object, default None

no parens

.format(fill=fill_value))

if axis is not None:
axis = self._get_axis_name(axis)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we normally use axis numbers internally (and not names). I agree its a bit easier on the eyes to do this, but we should be consistent (so revert here, ok to change it all over the base in another PR).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. This isn't a change from the current implementation, just moving it from frame. Will change.

@@ -72,6 +72,23 @@ def test_tz_aware_scalar_comparison(self, timestamps):
# -------------------------------------------------------------------
# Arithmetic

class TestFrameFlexArithmetic(object):
def test_df_add_flex_filled_mixed_dtypes(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this represent a change in user facing API?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, just a case that is not currently tested. I made a mistake in an early commit in this PR; this test would have caught it.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comments for the future

result = self._constructor(result, index=new_index, copy=False)
result.columns = new_columns
return result
return result

else:
result = _arith_op(this.values, other.values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is pretty tricky, yes that would work for most ufuncs, but is not generally applicable e.g. concat is a ufunc too.
so hold off on this as other low hanging fruit w/o opening up this box.

result = self._constructor(result, index=new_index, copy=False)
result.columns = new_columns
return result
return result

else:
result = _arith_op(this.values, other.values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more to the point is to push this down to the block manager for execution which handles the mixed dtypes case, this is just a special case of that.

@jreback jreback added this to the 0.23.0 milestone Feb 18, 2018
@jreback jreback merged commit 64e155c into pandas-dev:master Feb 18, 2018
@jreback
Copy link
Contributor

jreback commented Feb 18, 2018

thanks

@jbrockmendel jbrockmendel deleted the ops-frames branch February 18, 2018 18:23
harisbal pushed a commit to harisbal/pandas that referenced this pull request Feb 28, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Clean Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants