Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: pandas 2.1.2 changes how copy works #55763

Closed
3 tasks done
flying-sheep opened this issue Oct 30, 2023 · 3 comments · Fixed by #55764
Closed
3 tasks done

BUG: pandas 2.1.2 changes how copy works #55763

flying-sheep opened this issue Oct 30, 2023 · 3 comments · Fixed by #55764
Labels
Regression Functionality that used to work in a prior pandas version Subclassing Subclassing pandas objects
Milestone

Comments

@flying-sheep
Copy link
Contributor

flying-sheep commented Oct 30, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of pandas.
  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

class Sub(pd.DataFrame): pass

assert not isinstance(Sub().copy(), Sub)

Issue Description

Since very long ago, calling .copy() on a DataFrame subclass returned a DataFrame object. This should be changed in a major release, not a feature release, and definitely not a patch release.

Expected Behavior

The above assert statement to succeed

See also

Installed Versions

pandas 2.1.2 or 2.2.0.dev0+447

@flying-sheep flying-sheep added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 30, 2023
@flying-sheep
Copy link
Contributor Author

We’re pinning Pandas to !=2.1.2.

Can we rely on Pandas 2.2.0 to fix this?

@ivirshup
Copy link
Contributor

I believe this was specifically caused by the changes to: _constructor_from_mgr in #54922. Previously (2.1.1), this looked like:

    def _constructor_from_mgr(self, mgr, axes):
        df = self._from_mgr(mgr, axes=axes)

        if type(self) is DataFrame:
            # fastpath avoiding constructor call
            return df
        else:
            assert axes is mgr.axes
            return self._constructor(df, copy=False)

But now it looks like:

    def _constructor_from_mgr(self, mgr, axes):
        if self._constructor is DataFrame:
            # we are pandas.DataFrame (or a subclass that doesn't override _constructor)
            return self._from_mgr(mgr, axes=axes)
        else:
            assert axes is mgr.axes
            return self._constructor(mgr)

Behaviour changed for other methods, like .groupby, as well.

I believe we were relying on the constructor being called for our subclass, since we want to coerce to apd.DataFrame here. However, it's unclear if there is any mechanism for doing that because we would want ._constructor to be pd.DataFrame, which is specifically branching here.

I've opened a draft PR which may fix this:

This PR reverts the branching logic to check the class being passed in instead of the class of _constructor. It seems to work locally, but I'm not sure I totally understand why the bug inducing change was made.

Would appreciate a look from @jorisvandenbossche.

@mroeschke mroeschke added this to the 2.1.3 milestone Oct 30, 2023
@lithomas1 lithomas1 added Copy / view semantics Subclassing Subclassing pandas objects and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 31, 2023
@rhshadrach rhshadrach added the Regression Functionality that used to work in a prior pandas version label Oct 31, 2023
@jorisvandenbossche
Copy link
Member

Thanks for the detailed report and analysis, @flying-sheep and @ivirshup !
It's unfortunate that the fix for a recursion error in one subclass caused a recursion error in a different subclass ..

But yes, this is something we definitely should fix. I didn't consider the case of subclasses that don't get preserved but always return a plain pandas object for each operation.

dongjoon-hyun pushed a commit to apache/spark that referenced this issue Nov 15, 2023
### What changes were proposed in this pull request?
Upgrade pandas from 2.1.2 to 2.1.3

### Why are the changes needed?
Fixed infinite recursion from operations that return a new object on some DataFrame subclasses ([GH 55763](pandas-dev/pandas#55763))
and Fix [read_parquet()](https://pandas.pydata.org/docs/reference/api/pandas.read_parquet.html#pandas.read_parquet) and [read_feather()](https://pandas.pydata.org/docs/reference/api/pandas.read_feather.html#pandas.read_feather) for [CVE-2023-47248](https://www.cve.org/CVERecord?id=CVE-2023-47248) ([GH 55894](pandas-dev/pandas#55894))

[Release notes for 2.1.3](https://pandas.pydata.org/docs/whatsnew/v2.1.3.html)

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #43822 from bjornjorgensen/pandas-2_1_3.

Authored-by: Bjørn Jørgensen <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Regression Functionality that used to work in a prior pandas version Subclassing Subclassing pandas objects
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants