REGR: setting column with setitem should not modify existing array inplace #33457

jorisvandenbossche · 2020-04-10T14:18:57Z

So consider this example of a small dataframe with a nullable integer column:

def recreate_df():
    return pd.DataFrame({'int': [1, 2, 3], 'int2': [3, 4, 5],
                         'float': [.1, .2, .3],
                         'EA': pd.array([1, 2, None], dtype="Int64")
                        })

Assigning a new column with __setitem__ (df[col] = ...) normally does not even preserve the dtype:

In [2]: df = recreate_df() 
   ...: df['EA'] = np.array([1., 2., 3.]) 
   ...: df['EA'].dtype 
Out[2]: dtype('float64')

In [3]: df = recreate_df() 
   ...: df['EA'] = np.array([1, 2, 3]) 
   ...: df['EA'].dtype
Out[3]: dtype('int64')

When assigning a new nullable integer array, it of course keeps the dtype of the assigned values:

In [4]: df = recreate_df() 
   ...: df['EA'] = pd.array([1, 2, 3], dtype="Int64") 
   ...: df['EA'].dtype 
Out[4]: Int64Dtype()

However, in this case you now also have the tricky side-effect of being in place:

In [5]: df = recreate_df() 
   ...: original_arr = df.EA.array 
   ...: df['EA'] = pd.array([1, 2, 3], dtype="Int64") 
   ...: original_arr is df.EA.array  
Out[5]: True

I don't think this behaviour should depend on the values being set, and setitem should always replace the array of the ExtensionBlock.

Because with the above way, you can unexpectedly alter the data with which you created the dataframe. See also a different example using Categorical of this at the original PR that introduced this: #32831 (comment)

cc @jbrockmendel

The text was updated successfully, but these errors were encountered:

jbrockmendel · 2020-04-10T15:10:03Z

setitem should always replace the array of the ExtensionBlock

I'd be OK with this creating a new EB with a new EA, not wild about having a Block get a new array.

xref #33198, in both cases AFAICT the issue involves _can_hold_element being too permissive.

jorisvandenbossche · 2020-04-10T17:07:46Z

I don't care about the Block (for an end user / from the EA perspective, the Block does not have any state you might want to preserve, while the actual array does), so creating a new block is certainly fine. Probably the cleanest API anyway (assigning new column = new Block, regardless of the data type)

jorisvandenbossche · 2020-05-22T14:43:20Z

@jbrockmendel would you be able to look into this?

jbrockmendel · 2020-05-22T18:27:13Z

probably not before 1.1; im expecting my pandas time to diminish and am focused on wrapping uo frequencies and ops pushes ATM, will return to indexing after those

jorisvandenbossche · 2020-05-22T18:31:14Z

OK. Alternatively, could you take a minimal look at your original PR that caused this regression (#32831) to see if you have an idea how (broadly speaking) it could be solved there? That could help me getting started on trying to fix this myself.

This is a regression in master, so a blocker for 1.1, IMO

jbrockmendel · 2020-05-22T18:47:39Z

ill take a look

jorisvandenbossche · 2020-06-25T07:49:05Z

@jbrockmendel gentle ping for #33457 (comment)

jbrockmendel · 2020-07-06T17:24:21Z

Alternatively, could you take a minimal look at your original PR that caused this regression (#32831) to see if you have an idea how (broadly speaking) it could be solved there? That could help me getting started on trying to fix this myself.

In #32831 the behavior being addressed was a new array was being pinned with ExtensionBlock.values = values, which didn't match the behavior of other Block subclasses that do Block.values[locs] = values

I think the "make a new ExtensionBlock instead of calling eb.set" idea discussed above is likely the best bet in the short-run. Longer-run the base class implementation block.values[locs] = values becomes viable with 2D EAs, haven't looked into whether that solves the issue at hand.

jorisvandenbossche · 2020-07-11T19:56:53Z

@jreback please don't move issues from the 1.1 that other people have labeled as "blocker" without discussing it (or at least commenting about it that you changed it, changing a milestone doesn't give a notification)

jreback · 2020-07-11T20:14:53Z

@jorisvandenbossche this release is way behind if u want to move to 1.1.1 pls do
but we are moving forward and releasing the rc

jreback · 2020-07-11T20:15:31Z

there are way too many blockers that don’t have PRs if you want to put some up great
but if they don’t have anything now then they rn it blockers at all

TomAugspurger · 2020-07-13T15:10:07Z

Things that are labeled as regressions / blockers need to be discussed. I personally rely on milestones to track what needs to be closed out before the release.

w.r.t. this specific issue, I think I'm OK with releasing the RC without it.

jreback · 2020-07-13T15:25:08Z

Things that are labeled as regressions / blockers need to be discussed. I personally rely on milestones to track what needs to be closed out before the release.

w.r.t. this specific issue, I think I'm OK with releasing the RC without it.

maybe so, but these need to be done ASAP. we cannot keep delaying things. so either remove the blocker or put a comment on WHY this is a blocker AND WHY it needs to be fixed for 1.1 Just because something is a regression does not mean it absolutely needs fixing for 1.1, there is 1.1.x of course, and blocking the entire release is silly. . @pandas-dev/pandas-core

jreback · 2020-07-13T15:25:55Z

if there is NO PR up for an issue I will remove the blocker labels this wednesday.

TomAugspurger · 2020-07-13T15:38:24Z

@jreback can you comment on the issues when you're removing them?

jreback · 2020-07-13T15:41:36Z

sure

jbrockmendel · 2020-07-16T17:55:37Z

I need to clarify the expected/desired behavior. Using the example from the OP:

df = recreate_df()
ea_orig = df["EA"]
fa_orig = df["float"]

Doing either df['EA'] = np.array([1., 2., 3.]) or df['EA'] = np.array([.1, .2, .3]) does not alter ea_orig, while doing df['EA'] = pd.array([1, 2, 3], dtype="Int64") does.

The OP focuses on the EA column, but we get the same behavior if we set df["float"] = fa_orig * 2

fa_orig_copy = fa_orig.copy()
df["float"] = fa_orig * 2

>>> (fa_orig == fa_orig_copy).all()
False

@jorisvandenbossche the OP focuses on the EA column, but would you want to change the behavior for non-EA columns too? (Changing the behavior for all columns is a 1-line edit, haven't run the full test suite yet though)

TomAugspurger · 2020-07-27T14:15:43Z

I think our three options right now are

REGR: setting column with setitem should not modify existing array inplace #35266 (Joris' first)
REGR: revert ExtensionBlock.set to be in-place #35271 (Joris' second, targeted revert)
BUG: df[col] = arr should not overwrite data in df[col] #35417 (Brock's)

Of these I think we should go with #35271 for 1.1.0. It's the smallest change from 1.0.x.

Long-term, I think we want something like Brock's #35417 gets us consistency. But that probably should wait for 2.x

TomAugspurger · 2020-07-27T17:59:45Z

But that probably should wait for 2.x

Thinking through this a bit more. Hopefully the whatsnew over at https://github.com/pandas-dev/pandas/pull/35417/files is clarifying, but this is probably OK to do in 1.2.

The bit about consistently assigning a new array regardless of dtype is the important part. I hope that not too many people are relying on the current behavior one way or another, given the inconsistency.

jorisvandenbossche · 2020-08-18T19:00:10Z

The bit about consistently assigning a new array regardless of dtype is the important part. I hope that not too many people are relying on the current behavior one way or another, given the inconsistency.

It has more consequences than just the overwriting of the column in question or not, though.

One aspect I am thinking of / checking now is how this impacts consolidated blocks. Normally, assigning to an existing column (for a consolidated dtype) leaves the block structure intact:

In [1]: df = pd.DataFrame(np.random.randn(10, 4), columns=['a', 'b', 'c', 'd'])

In [2]: df._mgr
Out[2]: 
BlockManager
Items: Index(['a', 'b', 'c', 'd'], dtype='object')
Axis 1: RangeIndex(start=0, stop=10, step=1)
FloatBlock: slice(0, 4, 1), 4 x 10, dtype: float64

In [3]: block_values = df._mgr.blocks[0].values

In [4]: df['b'] = 0.0

In [5]: df._mgr
Out[5]: 
BlockManager
Items: Index(['a', 'b', 'c', 'd'], dtype='object')
Axis 1: RangeIndex(start=0, stop=10, step=1)
FloatBlock: slice(0, 4, 1), 4 x 10, dtype: float64

In [6]: df._mgr.blocks[0].values is block_values
Out[6]: True

In [7]: pd.__version__
Out[7]: '1.1.0'

While using #35417 branch (current state of time of posting):

In [1]: df = pd.DataFrame(np.random.randn(10, 4), columns=['a', 'b', 'c', 'd'])    

In [2]: df._mgr      
Out[2]: 
BlockManager
Items: Index(['a', 'b', 'c', 'd'], dtype='object')
Axis 1: RangeIndex(start=0, stop=10, step=1)
FloatBlock: slice(0, 4, 1), 4 x 10, dtype: float64

In [3]: df['b'] = 0.0   

In [4]: df._mgr  
Out[4]: 
BlockManager
Items: Index(['a', 'b', 'c', 'd'], dtype='object')
Axis 1: RangeIndex(start=0, stop=10, step=1)
FloatBlock: [0 2 3], 3 x 10, dtype: float64
FloatBlock: slice(1, 2, 1), 1 x 10, dtype: float64

So creating different blocks, and thus the assignment of one float column triggered a copy of all float columns (in this case it actually already copied due to the block layout, in some cases it might also be a slice, but once a next step performs consolidation, this will become a copy anyway).

jbrockmendel · 2020-08-18T20:20:52Z

So creating different blocks, and thus the assignment of one float column triggered a copy of all float columns

One option to avoid this copy would be to end up with three blocks, corresponding to df[["a"]], df[["b"]], and df[["c", "d"]], respecetively. The first and third could contain views on the original array.

jorisvandenbossche · 2020-08-18T20:26:43Z

That certainly avoids the copy initially, but as also mentioned above, once a next step in your analysis performs consolidation, this will still result in a full copy due to the assignment.

jorisvandenbossche · 2020-08-27T14:56:07Z

But that probably should wait for 2.x

Thinking through this a bit more. Hopefully the whatsnew over at https://github.com/pandas-dev/pandas/pull/35417/files is clarifying, but this is probably OK to do in 1.2.

I am personally not yet convinced that we should do this for 1.2:

It has API breaking changes. It is of course difficult to assess the potential impact, but I think we should at least do a bit more effort to think this through before saying that this is OK for 1.2, and shouldn't wait until 2.0.
It introduces additional copies (and pandas is already copy heavy) when assigning to a single column with consolidated blocks.

As I understand, a large part of the motivation is the inconsistency in behaviour between different dtypes?
Personally, I would be fine with accepting this difference (if the difference is between consolidated blocks vs non-consolidated extension blocks). There are several things we are doing different for the new extension dtypes, and assuming they will become the default in 2.0, I think it is fine to change the default behaviour that way, instead of already wanting to align the behaviour for consolidated vs extension blocks right now.

jbrockmendel · 2020-08-27T16:00:47Z

It has API breaking changes.

I consider the internal inconsistency to be a bug, plus reported bugs eg #35731 caused by the current behavior (no doubt that could be fixed with some other patch, but better to get at the root of the problem)

simonjayhawkins · 2021-06-11T13:44:08Z

removing milestone and blocker label

jbrockmendel · 2021-11-08T18:31:46Z

@jorisvandenbossche can you see if there is anything left to do here? AFAICT DataFrame.__setitem__ should now never modify the existing array inplace.

jorisvandenbossche added Indexing Related to indexing on series/frames, not to indexes themselves Regression Functionality that used to work in a prior pandas version labels Apr 10, 2020

jorisvandenbossche added this to the 1.1 milestone Apr 10, 2020

jorisvandenbossche mentioned this issue Apr 10, 2020

ENH: move CRS to GeometryArray level geopandas/geopandas#1339

Merged

9 tasks

martinfleis mentioned this issue Apr 12, 2020

BUG: GeoDataFrame CRS loss during __setitem__ on pandas master geopandas/geopandas#1372

Closed

mroeschke added the Bug label May 11, 2020

jorisvandenbossche added the Blocker Blocking issue or pull request for an upcoming release label Jun 13, 2020

jreback modified the milestones: 1.1, 1.1.1 Jul 11, 2020

jorisvandenbossche modified the milestones: 1.1.1, 1.1 Jul 11, 2020

jorisvandenbossche mentioned this issue Jul 13, 2020

REGR: setting column with setitem should not modify existing array inplace #35266

Closed

TomAugspurger mentioned this issue Jul 16, 2020

REGR: revert ExtensionBlock.set to be in-place #35271

Merged

jbrockmendel mentioned this issue Jul 26, 2020

BUG: df[col] = arr should not overwrite data in df[col] #35417

Closed

5 tasks

TomAugspurger mentioned this issue Jul 27, 2020

BUG: df reassignment following reorder_categories changed behavior in 1.1.0rc0 #35369

Closed

3 tasks

TomAugspurger modified the milestones: 1.1, 1.2 Jul 28, 2020

simonjayhawkins mentioned this issue Aug 17, 2020

BUG: Series has no attribute "reshape" after adding a new category in df #35731

Closed

3 tasks

jorisvandenbossche mentioned this issue Aug 26, 2020

TST/API: test column indexing copy/view semantics #35906

Closed

jbrockmendel added the Copy / view semantics label Sep 21, 2020

jreback modified the milestones: 1.2, 1.3 Nov 24, 2020

jbrockmendel mentioned this issue Jan 2, 2021

API: setitem copy/view behavior ndarray vs Categorical vs other EA #38896

Closed

jbrockmendel mentioned this issue Jan 14, 2021

API/BUG: always try to operate inplace when setting with loc/iloc[foo, bar] #39163

Merged

4 tasks

simonjayhawkins removed the Blocker Blocking issue or pull request for an upcoming release label Jun 11, 2021

simonjayhawkins removed this from the 1.3 milestone Jun 11, 2021

jbrockmendel mentioned this issue Nov 8, 2021

BUG/API: DataFrame.iloc[:, foo] = bar inplaceness? #44353

Closed

jbrockmendel added the Closing Candidate May be closeable, needs more eyeballs label Jan 12, 2022

jreback added this to the 1.5 milestone Jan 16, 2022

mroeschke removed this from the 1.5 milestone Aug 15, 2022

phofl added Needs Tests Unit test(s) needed to prevent regressions and removed Bug labels Apr 18, 2023

phofl mentioned this issue Apr 18, 2023

Add regression tests noatamir/pyladies-workshop#7

Closed

26 tasks

phofl mentioned this issue Jun 23, 2024

Add regression tests phofl/pydata-yerevan-sprint#1

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REGR: setting column with setitem should not modify existing array inplace #33457

REGR: setting column with setitem should not modify existing array inplace #33457

jorisvandenbossche commented Apr 10, 2020 •

edited

Loading

jbrockmendel commented Apr 10, 2020

jorisvandenbossche commented Apr 10, 2020

jorisvandenbossche commented May 22, 2020

jbrockmendel commented May 22, 2020

jorisvandenbossche commented May 22, 2020

jbrockmendel commented May 22, 2020

jorisvandenbossche commented Jun 25, 2020

jbrockmendel commented Jul 6, 2020

jorisvandenbossche commented Jul 11, 2020

jreback commented Jul 11, 2020

jreback commented Jul 11, 2020

TomAugspurger commented Jul 13, 2020

jreback commented Jul 13, 2020

jreback commented Jul 13, 2020

TomAugspurger commented Jul 13, 2020

jreback commented Jul 13, 2020

jbrockmendel commented Jul 16, 2020 •

edited by TomAugspurger

Loading

TomAugspurger commented Jul 27, 2020 •

edited

Loading

TomAugspurger commented Jul 27, 2020

jorisvandenbossche commented Aug 18, 2020 •

edited

Loading

jbrockmendel commented Aug 18, 2020

jorisvandenbossche commented Aug 18, 2020

jorisvandenbossche commented Aug 27, 2020

jbrockmendel commented Aug 27, 2020

simonjayhawkins commented Jun 11, 2021

jbrockmendel commented Nov 8, 2021

REGR: setting column with setitem should not modify existing array inplace #33457

REGR: setting column with setitem should not modify existing array inplace #33457

Comments

jorisvandenbossche commented Apr 10, 2020 • edited Loading

jbrockmendel commented Apr 10, 2020

jorisvandenbossche commented Apr 10, 2020

jorisvandenbossche commented May 22, 2020

jbrockmendel commented May 22, 2020

jorisvandenbossche commented May 22, 2020

jbrockmendel commented May 22, 2020

jorisvandenbossche commented Jun 25, 2020

jbrockmendel commented Jul 6, 2020

jorisvandenbossche commented Jul 11, 2020

jreback commented Jul 11, 2020

jreback commented Jul 11, 2020

TomAugspurger commented Jul 13, 2020

jreback commented Jul 13, 2020

jreback commented Jul 13, 2020

TomAugspurger commented Jul 13, 2020

jreback commented Jul 13, 2020

jbrockmendel commented Jul 16, 2020 • edited by TomAugspurger Loading

TomAugspurger commented Jul 27, 2020 • edited Loading

TomAugspurger commented Jul 27, 2020

jorisvandenbossche commented Aug 18, 2020 • edited Loading

jbrockmendel commented Aug 18, 2020

jorisvandenbossche commented Aug 18, 2020

jorisvandenbossche commented Aug 27, 2020

jbrockmendel commented Aug 27, 2020

simonjayhawkins commented Jun 11, 2021

jbrockmendel commented Nov 8, 2021

jorisvandenbossche commented Apr 10, 2020 •

edited

Loading

jbrockmendel commented Jul 16, 2020 •

edited by TomAugspurger

Loading

TomAugspurger commented Jul 27, 2020 •

edited

Loading

jorisvandenbossche commented Aug 18, 2020 •

edited

Loading