-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: DataFrame.(any|all) inconsistency #34918
BUG: DataFrame.(any|all) inconsistency #34918
Conversation
return self._combine([b for b in self.blocks if b.is_bool], copy) | ||
# Note: use is_bool_dtype instead of blk.is_bool to exclude | ||
# object-dtype blocks containing all-bool entries. | ||
return self._combine([b for b in self.blocks if is_bool_dtype(b.dtype)], copy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this may actually be a bug, since it implies the following inconsistent behavior in master:
ser = pd.Series([True, False, True], dtype=object)
ser2 = pd.Series(["A", "B", "C"])
df = ser.to_frame("A")
>>> df._get_bool_data()
A
0 True
1 False
2 True
df["B"] = ser2
>>> df._get_bool_data()
Empty DataFrame
Columns: []
Index: [0, 1, 2]
adding columns shouldnt make get_bool_data smaller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a test for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test added
…f-consolidate-equals
@@ -688,8 +689,9 @@ def get_bool_data(self, copy: bool = False) -> "BlockManager": | |||
copy : bool, default False | |||
Whether to copy the blocks | |||
""" | |||
self._consolidate_inplace() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think consolidating here might be the better option, as it at least ensures consistent behaviour when you have multiple columns independent of consolidation status. The inconsistency between a single column and multiple column is still present of course, but IMO there is nothing to do about this giving our consolidated blocks (well, we could deprecate object dtype being regarded as bool ..)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, we could deprecate object dtype being regarded as bool
This PR does that (well changes outright, not deprecates). AFAICT thats the only way to make the behavior independent of whether the presence of another object-dtype-but-not-bool-like column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we should deprecate this first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only non-test place where get_bool_data is called from is in DataFrame._reduce
which I know you've been working on recently. The topic of how to handle object-dtype blocks has come up there, too. If we end up handling object-dtype blocks column-wise there, that would render this distinction irrelevant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we should deprecate this first
what exactly are you suggesting we deprecate? do you have an example of something that breaks on this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche can you respond here re what specific behavior you'd like to deprecate?
pandas/core/internals/managers.py
Outdated
@@ -698,7 +700,6 @@ def get_numeric_data(self, copy: bool = False) -> "BlockManager": | |||
copy : bool, default False | |||
Whether to copy the blocks | |||
""" | |||
self._consolidate_inplace() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you check the impact of this?
df["B"] = ser2 | ||
|
||
bd2 = df._get_bool_data() | ||
tm.assert_frame_equal(bd1, bd2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also assert the actual expected result that is constructed manually instead of only ensuring both _get_bool_data
calls give the same?
@@ -688,8 +689,9 @@ def get_bool_data(self, copy: bool = False) -> "BlockManager": | |||
copy : bool, default False | |||
Whether to copy the blocks | |||
""" | |||
self._consolidate_inplace() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we should deprecate this first
…f-consolidate-equals
…f-consolidate-equals
…f-consolidate-equals
…f-consolidate-equals
…f-consolidate-equals
AFAICT there are 3 concerns to be balanced here:
I see two possible ways to accomplish get 1) consistent with 2): a) drop object-dtype blocks in I lean towards a) because it is simpler to implement and describe/document. Since it is fixing a bug in a corner case, I don't think a deprecation cycle is needed |
…f-consolidate-equals
I am also in favor of your option a). Although b) is technically possible (and something we would also get with 1D blocks), I think long term we should simply not regard object dtype columns as boolean or infer if they might be boolean (the same goes for indexing with object dtype bools). But, I still think we can deprecate this first instead of directly changing. Yes, it's quite a specific case, but it doesn't seem that complicated to deprecate? (the logic is only in |
AFAIK the only things affected are
The corner-ness of it makes me not care that much about deprecate vs change. I marginally lean towards getting it over with because there is a bug fix involved and it will make it easier to simplify |
…f-consolidate-equals
…f-consolidate-equals
…f-consolidate-equals
…f-consolidate-equals
…f-consolidate-equals
…f-consolidate-equals
…f-consolidate-equals
…f-consolidate-equals
…f-consolidate-equals
…f-consolidate-equals
Went through the issues and added Reduction and Nuisance Column labels where appropriate. Reinforced my belief that numeric_only needs to be thrown into the sun. |
…f-consolidate-equals
…f-consolidate-equals
…f-consolidate-equals
…f-consolidate-equals
…f-consolidate-equals
This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this. |
…f-consolidate-equals
rebased. i maintain this bug merits ripping off the bandaid. |
…f-consolidate-equals
No description provided.