-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dispatch frame methods to series versions instead of re-implementing masking etc #19611
Conversation
def _combine_match_index(self, other, func, level=None): | ||
left, right = self.align(other, join='outer', axis=0, level=level, | ||
copy=False) | ||
return self._constructor(func(left.values.T, right.values).T, | ||
new_data = func(left.values.T, right.values).T |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can do better than operating on left.values.T
and right.values
. Especially with mixed-dtypes this seems like low-hanging fruit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
most of these types of operations need to be pushed down to internals via block operations. this should dispatch to ._data.eval to do this (like the other match functions)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yah, but I'm not there yet. These particular cases are tricky because SparseDataFrame implements _combine_match_foo
differently
result = self._constructor(result, index=new_index, copy=False) | ||
result.columns = new_columns | ||
return result | ||
return result | ||
|
||
else: | ||
result = _arith_op(this.values, other.values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to dispatch to Series here, but presumably doing the operation once instead of per-column has a perf benefit. One idea that didn't work on the first try was to ravel() this.values
and other.values
, wrap them in Series
, then operate and re-shape.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is pretty tricky, yes that would work for most ufuncs, but is not generally applicable e.g. concat is a ufunc too.
so hold off on this as other low hanging fruit w/o opening up this box.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more to the point is to push this down to the block manager for execution which handles the mixed dtypes case, this is just a special case of that.
Codecov Report
@@ Coverage Diff @@
## master #19611 +/- ##
==========================================
- Coverage 91.58% 91.57% -0.01%
==========================================
Files 150 150
Lines 48867 48856 -11
==========================================
- Hits 44755 44741 -14
- Misses 4112 4115 +3
Continue to review full report at Codecov.
|
def _combine_match_index(self, other, func, level=None): | ||
left, right = self.align(other, join='outer', axis=0, level=level, | ||
copy=False) | ||
return self._constructor(func(left.values.T, right.values).T, | ||
new_data = func(left.values.T, right.values).T |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
most of these types of operations need to be pushed down to internals via block operations. this should dispatch to ._data.eval to do this (like the other match functions)
does this change anything from a user pov? |
No, but I may have made a mistake here. Pls put a pin in this until I double-check. |
I did in fact make a mistake, but there are no existing tests that catch it. Next push will fix both of these. |
Will hold off on this until #19613 is closed. |
@@ -998,6 +998,33 @@ def flex_wrapper(self, other, level=None, fill_value=None, axis=0): | |||
# ----------------------------------------------------------------------------- | |||
# DataFrame | |||
|
|||
def _combine_series_frame(self, other, func, fill_value=None, axis=None, | |||
level=None, try_cast=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a doc-string
rebase |
did you have another PR which touched this code and we wanted to merge first? or is this one ok? if so rebase |
There was, and it was merged a few days ago. #19613. This should be good to go (after rebase) |
invalid_op : function | ||
""" | ||
def invalid_op(self, other=None): | ||
raise TypeError("cannot perform {name} with this index type: " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you might need to parameterize this on the work 'index' but not sure (IOW can it be a Series)? or maybe just remove the word index ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ATM it is only used for indexes. I could add an assertion that self is an Index so that we know to change it if/when the wording ceases to be accurate.
pandas/core/ops.py
Outdated
self : DataFrame | ||
other : Series | ||
func : binary operator | ||
fill_value : object (default None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use the format
fill_value : object, default None
no parens
pandas/core/ops.py
Outdated
.format(fill=fill_value)) | ||
|
||
if axis is not None: | ||
axis = self._get_axis_name(axis) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we normally use axis numbers internally (and not names). I agree its a bit easier on the eyes to do this, but we should be consistent (so revert here, ok to change it all over the base in another PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. This isn't a change from the current implementation, just moving it from frame. Will change.
@@ -72,6 +72,23 @@ def test_tz_aware_scalar_comparison(self, timestamps): | |||
# ------------------------------------------------------------------- | |||
# Arithmetic | |||
|
|||
class TestFrameFlexArithmetic(object): | |||
def test_df_add_flex_filled_mixed_dtypes(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this represent a change in user facing API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, just a case that is not currently tested. I made a mistake in an early commit in this PR; this test would have caught it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small comments for the future
result = self._constructor(result, index=new_index, copy=False) | ||
result.columns = new_columns | ||
return result | ||
return result | ||
|
||
else: | ||
result = _arith_op(this.values, other.values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is pretty tricky, yes that would work for most ufuncs, but is not generally applicable e.g. concat is a ufunc too.
so hold off on this as other low hanging fruit w/o opening up this box.
result = self._constructor(result, index=new_index, copy=False) | ||
result.columns = new_columns | ||
return result | ||
return result | ||
|
||
else: | ||
result = _arith_op(this.values, other.values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more to the point is to push this down to the block manager for execution which handles the mixed dtypes case, this is just a special case of that.
thanks |
This moves some of DataFrame's dispatching logic into
ops
, mostly in ways that don't change any logic.The place that logic is changed is in
DataFrame._combine_frame
where instead of definingarith_op
to maskfunc
and wrapping{col: arith_op(this[col].values, other[col].values) for col in this.columns}
, we dispatch directly to the Series methods and wrap{col: func(this[col], other[col]) for col in this.columns}