Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: DataFrame.(any|all) inconsistency #34918

Closed

Conversation

jbrockmendel
Copy link
Member

No description provided.

return self._combine([b for b in self.blocks if b.is_bool], copy)
# Note: use is_bool_dtype instead of blk.is_bool to exclude
# object-dtype blocks containing all-bool entries.
return self._combine([b for b in self.blocks if is_bool_dtype(b.dtype)], copy)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this may actually be a bug, since it implies the following inconsistent behavior in master:

ser = pd.Series([True, False, True], dtype=object)
ser2 = pd.Series(["A", "B", "C"])

df = ser.to_frame("A")
>>> df._get_bool_data()
       A
0   True
1  False
2   True

df["B"] = ser2
>>> df._get_bool_data()
Empty DataFrame
Columns: []
Index: [0, 1, 2]

adding columns shouldnt make get_bool_data smaller.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a test for this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test added

@jreback jreback added Clean Internals Related to non-user accessible pandas implementation labels Jun 23, 2020
@@ -688,8 +689,9 @@ def get_bool_data(self, copy: bool = False) -> "BlockManager":
copy : bool, default False
Whether to copy the blocks
"""
self._consolidate_inplace()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think consolidating here might be the better option, as it at least ensures consistent behaviour when you have multiple columns independent of consolidation status. The inconsistency between a single column and multiple column is still present of course, but IMO there is nothing to do about this giving our consolidated blocks (well, we could deprecate object dtype being regarded as bool ..)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, we could deprecate object dtype being regarded as bool

This PR does that (well changes outright, not deprecates). AFAICT thats the only way to make the behavior independent of whether the presence of another object-dtype-but-not-bool-like column.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should deprecate this first

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only non-test place where get_bool_data is called from is in DataFrame._reduce which I know you've been working on recently. The topic of how to handle object-dtype blocks has come up there, too. If we end up handling object-dtype blocks column-wise there, that would render this distinction irrelevant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should deprecate this first

what exactly are you suggesting we deprecate? do you have an example of something that breaks on this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche can you respond here re what specific behavior you'd like to deprecate?

@@ -698,7 +700,6 @@ def get_numeric_data(self, copy: bool = False) -> "BlockManager":
copy : bool, default False
Whether to copy the blocks
"""
self._consolidate_inplace()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you check the impact of this?

df["B"] = ser2

bd2 = df._get_bool_data()
tm.assert_frame_equal(bd1, bd2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also assert the actual expected result that is constructed manually instead of only ensuring both _get_bool_data calls give the same?

@@ -688,8 +689,9 @@ def get_bool_data(self, copy: bool = False) -> "BlockManager":
copy : bool, default False
Whether to copy the blocks
"""
self._consolidate_inplace()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should deprecate this first

@jbrockmendel
Copy link
Member Author

AFAICT there are 3 concerns to be balanced here:

  1. internal consistency w/r/t sub-frames, i.e df[subset].all(bool_only=True) should never be larger than df.all(bool_only=True) (bug in master)
  2. independence of consolidation status (in the status quo we consolidate, so this is not an issue)
  3. backwards compatibility

I see two possible ways to accomplish get 1) consistent with 2):

a) drop object-dtype blocks in get_bool_data (i.e. what this PR does ATM). This will produce some cases where df.all(bool_only=True) is smaller than it is in master.
b) operate column-wise on object-dtype blocks. This will produce some cases where df.all(bool_only=True) is larger than it is in master.

I lean towards a) because it is simpler to implement and describe/document. Since it is fixing a bug in a corner case, I don't think a deprecation cycle is needed

@jorisvandenbossche
Copy link
Member

I am also in favor of your option a). Although b) is technically possible (and something we would also get with 1D blocks), I think long term we should simply not regard object dtype columns as boolean or infer if they might be boolean (the same goes for indexing with object dtype bools).

But, I still think we can deprecate this first instead of directly changing. Yes, it's quite a specific case, but it doesn't seem that complicated to deprecate? (the logic is only in _get_bool_data?)

@jbrockmendel
Copy link
Member Author

(the logic is only in _get_bool_data?)

AFAIK the only things affected are DataFrame.all(bool_only=True) and DataFrame.any(bool_only=True)

But, I still think we can deprecate this first instead of directly changing. Yes, it's quite a specific case, but it doesn't seem that complicated to deprecate?

The corner-ness of it makes me not care that much about deprecate vs change. I marginally lean towards getting it over with because there is a bug fix involved and it will make it easier to simplify DataFrame._reduce.

@jbrockmendel jbrockmendel changed the title CLN: avoid unnecessary consolidate_inplace calls BUG: DataFrame.(any|all) inconsistency Sep 8, 2020
@jbrockmendel
Copy link
Member Author

Went through the issues and added Reduction and Nuisance Column labels where appropriate. Reinforced my belief that numeric_only needs to be thrown into the sun.

@github-actions
Copy link
Contributor

github-actions bot commented Nov 5, 2020

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Nov 5, 2020
@jbrockmendel
Copy link
Member Author

rebased. i maintain this bug merits ripping off the bandaid.

@jbrockmendel jbrockmendel deleted the ref-consolidate-equals branch November 14, 2020 02:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Clean Internals Related to non-user accessible pandas implementation Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants