Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REGR: setting column with setitem should not modify existing array inplace #33457

Open
Tracked by #1 ...
jorisvandenbossche opened this issue Apr 10, 2020 · 34 comments
Open
Tracked by #1 ...
Labels
Closing Candidate May be closeable, needs more eyeballs Copy / view semantics Indexing Related to indexing on series/frames, not to indexes themselves Needs Tests Unit test(s) needed to prevent regressions Regression Functionality that used to work in a prior pandas version

Comments

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Apr 10, 2020

So consider this example of a small dataframe with a nullable integer column:

def recreate_df():
    return pd.DataFrame({'int': [1, 2, 3], 'int2': [3, 4, 5],
                         'float': [.1, .2, .3],
                         'EA': pd.array([1, 2, None], dtype="Int64")
                        })

Assigning a new column with __setitem__ (df[col] = ...) normally does not even preserve the dtype:

In [2]: df = recreate_df() 
   ...: df['EA'] = np.array([1., 2., 3.]) 
   ...: df['EA'].dtype 
Out[2]: dtype('float64')

In [3]: df = recreate_df() 
   ...: df['EA'] = np.array([1, 2, 3]) 
   ...: df['EA'].dtype
Out[3]: dtype('int64')

When assigning a new nullable integer array, it of course keeps the dtype of the assigned values:

In [4]: df = recreate_df() 
   ...: df['EA'] = pd.array([1, 2, 3], dtype="Int64") 
   ...: df['EA'].dtype 
Out[4]: Int64Dtype()

However, in this case you now also have the tricky side-effect of being in place:

In [5]: df = recreate_df() 
   ...: original_arr = df.EA.array 
   ...: df['EA'] = pd.array([1, 2, 3], dtype="Int64") 
   ...: original_arr is df.EA.array  
Out[5]: True

I don't think this behaviour should depend on the values being set, and setitem should always replace the array of the ExtensionBlock.

Because with the above way, you can unexpectedly alter the data with which you created the dataframe. See also a different example using Categorical of this at the original PR that introduced this: #32831 (comment)

cc @jbrockmendel

@jorisvandenbossche jorisvandenbossche added Indexing Related to indexing on series/frames, not to indexes themselves Regression Functionality that used to work in a prior pandas version labels Apr 10, 2020
@jorisvandenbossche jorisvandenbossche added this to the 1.1 milestone Apr 10, 2020
@jbrockmendel
Copy link
Member

setitem should always replace the array of the ExtensionBlock

I'd be OK with this creating a new EB with a new EA, not wild about having a Block get a new array.

xref #33198, in both cases AFAICT the issue involves _can_hold_element being too permissive.

@jorisvandenbossche
Copy link
Member Author

I don't care about the Block (for an end user / from the EA perspective, the Block does not have any state you might want to preserve, while the actual array does), so creating a new block is certainly fine. Probably the cleanest API anyway (assigning new column = new Block, regardless of the data type)

@jorisvandenbossche
Copy link
Member Author

@jbrockmendel would you be able to look into this?

@jbrockmendel
Copy link
Member

probably not before 1.1; im expecting my pandas time to diminish and am focused on wrapping uo frequencies and ops pushes ATM, will return to indexing after those

@jorisvandenbossche
Copy link
Member Author

OK. Alternatively, could you take a minimal look at your original PR that caused this regression (#32831) to see if you have an idea how (broadly speaking) it could be solved there? That could help me getting started on trying to fix this myself.

This is a regression in master, so a blocker for 1.1, IMO

@jbrockmendel
Copy link
Member

ill take a look

@jorisvandenbossche jorisvandenbossche added the Blocker Blocking issue or pull request for an upcoming release label Jun 13, 2020
@jorisvandenbossche
Copy link
Member Author

@jbrockmendel gentle ping for #33457 (comment)

@jbrockmendel
Copy link
Member

Alternatively, could you take a minimal look at your original PR that caused this regression (#32831) to see if you have an idea how (broadly speaking) it could be solved there? That could help me getting started on trying to fix this myself.

In #32831 the behavior being addressed was a new array was being pinned with ExtensionBlock.values = values, which didn't match the behavior of other Block subclasses that do Block.values[locs] = values

I think the "make a new ExtensionBlock instead of calling eb.set" idea discussed above is likely the best bet in the short-run. Longer-run the base class implementation block.values[locs] = values becomes viable with 2D EAs, haven't looked into whether that solves the issue at hand.

@jreback jreback modified the milestones: 1.1, 1.1.1 Jul 11, 2020
@jorisvandenbossche jorisvandenbossche modified the milestones: 1.1.1, 1.1 Jul 11, 2020
@jorisvandenbossche
Copy link
Member Author

@jreback please don't move issues from the 1.1 that other people have labeled as "blocker" without discussing it (or at least commenting about it that you changed it, changing a milestone doesn't give a notification)

@jreback
Copy link
Contributor

jreback commented Jul 11, 2020

@jorisvandenbossche this release is way behind if u want to move to 1.1.1 pls do
but we are moving forward and releasing the rc

@jreback
Copy link
Contributor

jreback commented Jul 11, 2020

there are way too many blockers that don’t have PRs if you want to put some up great
but if they don’t have anything now then they rn it blockers at all

@TomAugspurger
Copy link
Contributor

Things that are labeled as regressions / blockers need to be discussed. I personally rely on milestones to track what needs to be closed out before the release.

w.r.t. this specific issue, I think I'm OK with releasing the RC without it.

@jreback
Copy link
Contributor

jreback commented Jul 13, 2020

Things that are labeled as regressions / blockers need to be discussed. I personally rely on milestones to track what needs to be closed out before the release.

w.r.t. this specific issue, I think I'm OK with releasing the RC without it.

maybe so, but these need to be done ASAP. we cannot keep delaying things. so either remove the blocker or put a comment on WHY this is a blocker AND WHY it needs to be fixed for 1.1 Just because something is a regression does not mean it absolutely needs fixing for 1.1, there is 1.1.x of course, and blocking the entire release is silly. . @pandas-dev/pandas-core

@jreback
Copy link
Contributor

jreback commented Jul 13, 2020

if there is NO PR up for an issue I will remove the blocker labels this wednesday.

@TomAugspurger
Copy link
Contributor

@jreback can you comment on the issues when you're removing them?

@jreback
Copy link
Contributor

jreback commented Jul 13, 2020

sure

@jbrockmendel
Copy link
Member

jbrockmendel commented Jul 16, 2020

I need to clarify the expected/desired behavior. Using the example from the OP:

df = recreate_df()
ea_orig = df["EA"]
fa_orig = df["float"]

Doing either df['EA'] = np.array([1., 2., 3.]) or df['EA'] = np.array([.1, .2, .3]) does not alter ea_orig, while doing df['EA'] = pd.array([1, 2, 3], dtype="Int64") does.

The OP focuses on the EA column, but we get the same behavior if we set df["float"] = fa_orig * 2

fa_orig_copy = fa_orig.copy()
df["float"] = fa_orig * 2

>>> (fa_orig == fa_orig_copy).all()
False

@jorisvandenbossche the OP focuses on the EA column, but would you want to change the behavior for non-EA columns too? (Changing the behavior for all columns is a 1-line edit, haven't run the full test suite yet though)

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jul 27, 2020

I think our three options right now are

Of these I think we should go with #35271 for 1.1.0. It's the smallest change from 1.0.x.

Long-term, I think we want something like Brock's #35417 gets us consistency. But that probably should wait for 2.x

@TomAugspurger
Copy link
Contributor

But that probably should wait for 2.x

Thinking through this a bit more. Hopefully the whatsnew over at https://github.com/pandas-dev/pandas/pull/35417/files is clarifying, but this is probably OK to do in 1.2.

The bit about consistently assigning a new array regardless of dtype is the important part. I hope that not too many people are relying on the current behavior one way or another, given the inconsistency.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Aug 18, 2020

The bit about consistently assigning a new array regardless of dtype is the important part. I hope that not too many people are relying on the current behavior one way or another, given the inconsistency.

It has more consequences than just the overwriting of the column in question or not, though.

One aspect I am thinking of / checking now is how this impacts consolidated blocks. Normally, assigning to an existing column (for a consolidated dtype) leaves the block structure intact:

In [1]: df = pd.DataFrame(np.random.randn(10, 4), columns=['a', 'b', 'c', 'd'])

In [2]: df._mgr
Out[2]: 
BlockManager
Items: Index(['a', 'b', 'c', 'd'], dtype='object')
Axis 1: RangeIndex(start=0, stop=10, step=1)
FloatBlock: slice(0, 4, 1), 4 x 10, dtype: float64

In [3]: block_values = df._mgr.blocks[0].values

In [4]: df['b'] = 0.0

In [5]: df._mgr
Out[5]: 
BlockManager
Items: Index(['a', 'b', 'c', 'd'], dtype='object')
Axis 1: RangeIndex(start=0, stop=10, step=1)
FloatBlock: slice(0, 4, 1), 4 x 10, dtype: float64

In [6]: df._mgr.blocks[0].values is block_values
Out[6]: True

In [7]: pd.__version__
Out[7]: '1.1.0'

While using #35417 branch (current state of time of posting):

In [1]: df = pd.DataFrame(np.random.randn(10, 4), columns=['a', 'b', 'c', 'd'])    

In [2]: df._mgr      
Out[2]: 
BlockManager
Items: Index(['a', 'b', 'c', 'd'], dtype='object')
Axis 1: RangeIndex(start=0, stop=10, step=1)
FloatBlock: slice(0, 4, 1), 4 x 10, dtype: float64

In [3]: df['b'] = 0.0   

In [4]: df._mgr  
Out[4]: 
BlockManager
Items: Index(['a', 'b', 'c', 'd'], dtype='object')
Axis 1: RangeIndex(start=0, stop=10, step=1)
FloatBlock: [0 2 3], 3 x 10, dtype: float64
FloatBlock: slice(1, 2, 1), 1 x 10, dtype: float64

So creating different blocks, and thus the assignment of one float column triggered a copy of all float columns (in this case it actually already copied due to the block layout, in some cases it might also be a slice, but once a next step performs consolidation, this will become a copy anyway).

@jbrockmendel
Copy link
Member

So creating different blocks, and thus the assignment of one float column triggered a copy of all float columns

One option to avoid this copy would be to end up with three blocks, corresponding to df[["a"]], df[["b"]], and df[["c", "d"]], respecetively. The first and third could contain views on the original array.

@jorisvandenbossche
Copy link
Member Author

That certainly avoids the copy initially, but as also mentioned above, once a next step in your analysis performs consolidation, this will still result in a full copy due to the assignment.

@jorisvandenbossche
Copy link
Member Author

But that probably should wait for 2.x

Thinking through this a bit more. Hopefully the whatsnew over at https://github.com/pandas-dev/pandas/pull/35417/files is clarifying, but this is probably OK to do in 1.2.

I am personally not yet convinced that we should do this for 1.2:

  • It has API breaking changes. It is of course difficult to assess the potential impact, but I think we should at least do a bit more effort to think this through before saying that this is OK for 1.2, and shouldn't wait until 2.0.
  • It introduces additional copies (and pandas is already copy heavy) when assigning to a single column with consolidated blocks.

As I understand, a large part of the motivation is the inconsistency in behaviour between different dtypes?
Personally, I would be fine with accepting this difference (if the difference is between consolidated blocks vs non-consolidated extension blocks). There are several things we are doing different for the new extension dtypes, and assuming they will become the default in 2.0, I think it is fine to change the default behaviour that way, instead of already wanting to align the behaviour for consolidated vs extension blocks right now.

@jbrockmendel
Copy link
Member

It has API breaking changes.

I consider the internal inconsistency to be a bug, plus reported bugs eg #35731 caused by the current behavior (no doubt that could be fixed with some other patch, but better to get at the root of the problem)

@simonjayhawkins
Copy link
Member

removing milestone and blocker label

@simonjayhawkins simonjayhawkins removed the Blocker Blocking issue or pull request for an upcoming release label Jun 11, 2021
@simonjayhawkins simonjayhawkins removed this from the 1.3 milestone Jun 11, 2021
@jbrockmendel
Copy link
Member

@jorisvandenbossche can you see if there is anything left to do here? AFAICT DataFrame.__setitem__ should now never modify the existing array inplace.

@jbrockmendel jbrockmendel added the Closing Candidate May be closeable, needs more eyeballs label Jan 12, 2022
@jreback jreback added this to the 1.5 milestone Jan 16, 2022
@mroeschke mroeschke removed this from the 1.5 milestone Aug 15, 2022
@phofl phofl added Needs Tests Unit test(s) needed to prevent regressions and removed Bug labels Apr 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Closing Candidate May be closeable, needs more eyeballs Copy / view semantics Indexing Related to indexing on series/frames, not to indexes themselves Needs Tests Unit test(s) needed to prevent regressions Regression Functionality that used to work in a prior pandas version
Projects
None yet
7 participants