-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REGR: setting column with setitem should not modify existing array inplace #33457
Comments
I'd be OK with this creating a new EB with a new EA, not wild about having a Block get a new array. xref #33198, in both cases AFAICT the issue involves |
I don't care about the Block (for an end user / from the EA perspective, the Block does not have any state you might want to preserve, while the actual array does), so creating a new block is certainly fine. Probably the cleanest API anyway (assigning new column = new Block, regardless of the data type) |
@jbrockmendel would you be able to look into this? |
probably not before 1.1; im expecting my pandas time to diminish and am focused on wrapping uo frequencies and ops pushes ATM, will return to indexing after those |
OK. Alternatively, could you take a minimal look at your original PR that caused this regression (#32831) to see if you have an idea how (broadly speaking) it could be solved there? That could help me getting started on trying to fix this myself. This is a regression in master, so a blocker for 1.1, IMO |
ill take a look |
@jbrockmendel gentle ping for #33457 (comment) |
In #32831 the behavior being addressed was a new array was being pinned with I think the "make a new |
@jreback please don't move issues from the 1.1 that other people have labeled as "blocker" without discussing it (or at least commenting about it that you changed it, changing a milestone doesn't give a notification) |
@jorisvandenbossche this release is way behind if u want to move to 1.1.1 pls do |
there are way too many blockers that don’t have PRs if you want to put some up great |
Things that are labeled as regressions / blockers need to be discussed. I personally rely on milestones to track what needs to be closed out before the release. w.r.t. this specific issue, I think I'm OK with releasing the RC without it. |
maybe so, but these need to be done ASAP. we cannot keep delaying things. so either remove the blocker or put a comment on WHY this is a blocker AND WHY it needs to be fixed for 1.1 Just because something is a regression does not mean it absolutely needs fixing for 1.1, there is 1.1.x of course, and blocking the entire release is silly. . @pandas-dev/pandas-core |
if there is NO PR up for an issue I will remove the blocker labels this wednesday. |
@jreback can you comment on the issues when you're removing them? |
sure |
I need to clarify the expected/desired behavior. Using the example from the OP:
Doing either The OP focuses on the EA column, but we get the same behavior if we set
@jorisvandenbossche the OP focuses on the EA column, but would you want to change the behavior for non-EA columns too? (Changing the behavior for all columns is a 1-line edit, haven't run the full test suite yet though) |
I think our three options right now are
Of these I think we should go with #35271 for 1.1.0. It's the smallest change from 1.0.x. Long-term, I think we want something like Brock's #35417 gets us consistency. But that probably should wait for 2.x |
Thinking through this a bit more. Hopefully the whatsnew over at https://github.com/pandas-dev/pandas/pull/35417/files is clarifying, but this is probably OK to do in 1.2. The bit about consistently assigning a new array regardless of dtype is the important part. I hope that not too many people are relying on the current behavior one way or another, given the inconsistency. |
It has more consequences than just the overwriting of the column in question or not, though. One aspect I am thinking of / checking now is how this impacts consolidated blocks. Normally, assigning to an existing column (for a consolidated dtype) leaves the block structure intact:
While using #35417 branch (current state of time of posting):
So creating different blocks, and thus the assignment of one float column triggered a copy of all float columns (in this case it actually already copied due to the block layout, in some cases it might also be a slice, but once a next step performs consolidation, this will become a copy anyway). |
One option to avoid this copy would be to end up with three blocks, corresponding to |
That certainly avoids the copy initially, but as also mentioned above, once a next step in your analysis performs consolidation, this will still result in a full copy due to the assignment. |
I am personally not yet convinced that we should do this for 1.2:
As I understand, a large part of the motivation is the inconsistency in behaviour between different dtypes? |
I consider the internal inconsistency to be a bug, plus reported bugs eg #35731 caused by the current behavior (no doubt that could be fixed with some other patch, but better to get at the root of the problem) |
removing milestone and blocker label |
@jorisvandenbossche can you see if there is anything left to do here? AFAICT |
So consider this example of a small dataframe with a nullable integer column:
Assigning a new column with
__setitem__
(df[col] = ...
) normally does not even preserve the dtype:When assigning a new nullable integer array, it of course keeps the dtype of the assigned values:
However, in this case you now also have the tricky side-effect of being in place:
I don't think this behaviour should depend on the values being set, and setitem should always replace the array of the ExtensionBlock.
Because with the above way, you can unexpectedly alter the data with which you created the dataframe. See also a different example using Categorical of this at the original PR that introduced this: #32831 (comment)
cc @jbrockmendel
The text was updated successfully, but these errors were encountered: