-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DISCUSS/API: setitem-like operations should only update inplace and never fallback with upcast (i.e never change the dtype) #39584
Comments
Sidenote: the extension arrays are actually already more strict on this (which is also needed, otherwise setitem could change the class of the object). But the upcasting logic lives a level higher on the Series/DataFrame, where the underlying array gets swapped when an upcast happens. So in other words, the proposal is to propagate that stricter behaviour of the arrays also to Series/DataFrame. |
I tend to agree this is a step to take sooner or later. I don't think I ever met a case in which implicitly upcasting via a setitem on individual elements was a desired feature and not a bug. Clearly (?), this should not apply to replacing an entire column of a |
I am generally positively disposed towards this idea. Some things that are
not obvious:
1) Does this apply to setitem-like ops on Index? In particular, if I add a
new column to a DataFrame and doing so would require casting the existing
columns, does that raise?
2) Can we get a complete-ish list of what constitutes setitem-like? (e.g.
above I'm assuming Index.append counts, but that _cant_ be inplace, so it
might reasonably be excluded)
3) Some of the casting is a result of fallback-on-failure, but other pieces
are due to downcasting-on-success. (see Block._maybe_downcast and
Block.convert; affected methods include where, fillna, interpolate,
replace). Is the proposal to change these behaviors too?
4) Because the fallback-on-failure only occurs after a failure, it would be
fairly cheap to do a pd.get_option lookup (or an obj.flags lookup) to
decide on cast-vs-raise. The major downside would be having to
test/support two variants, the upside is that the It Just Works behavior is
often really convenient.
…On Wed, Feb 3, 2021 at 2:52 PM Pietro Battiston ***@***.***> wrote:
I tend to agree this is a step to take sooner or later. I don't think I
ever met a case in which implicitly upcasting via a setitem on individual
elements was a desired feature and not a bug.
Clearly (?), this should *not* apply to replacing an entire column of a
DataFrame (or multiple columns) with new ones. Or if we want to state it
more generally: dtype should not change *unless* the smallest
dtype-bearing block (which is the column) is entirely replaced. And just
for completeness: it would apply to df.loc[:, 'col'] = s but not to df['col']
= s (notice that the former currently replaces the dtype e.g. if col had
previously int dtype and s has Timestamp, something that I suspect should
not happen).
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
<#39584 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB5UM6DYFKNWBUCZHFEYWX3S5HHSDANCNFSM4XBUHQGA>
.
|
@toobaz Yes, thanks for explicitly stating that, as I forgot to mention it. Indeed, the proposal is about the cases where (a subset of) the values are changed in place, and not where we are replacing a full column.
Indeed, that's probably the distinction we want to make. There is some discussion about this in #38896 (comment)
Good point to explicitly call out. I would say: yes, Index and Series should generally be consistent. But note that we don't allow direct (by the user) To be explicit: this proposal does not cover concat/append methods or set operations (union, intersection, etc), as those class of functions inherently create new objects (with a potentially different shape or order) and follow the upcasting rules ( The case you mention of adding a new column to a DataFrame (
Let's leave that for a separate issue to discuss, as in theory that's orthogonal in implementation (although certainly related). AFAIK we never do that for actual setitem, but only in a few specific methods in the internals (fillna, interpolate, where, replace). |
Not sure I follow: |
The But indeed good to think about a more complete list. After going through the namespace of Series/DataFrame, I think there are basically two groups:
So any other method that potentially changes the shape is not included (eg append). Those can also typically not be expressed as setitem equivalent (setitem with arrays cannot expand). For the above list, I would say that direct |
@toobaz does my last comment clarify that? (we could also keep the two groups (actual setitem vs methods) as two separate discussions, if that helps) |
@jorisvandenbossche it does clarify, although as you pointed out, members of your second group of methods are more heterogeneous in their behavior and hence a general rule of (not) upcasting might be hard to enforce (harder than just saying "please be aware that every time you use To the extent that we (at least in a first stage) focus on the first group, then, (In general, as long as |
This might be heresy, but maybe this proposal should be considered in conjunction with an idea of getting rid of |
I stated my opinion above: if we keep |
I also don't think that the discussion on keeping the A method like |
This was discussed on the call on Wednesday. @jorisvandenbossche would you like to summarize the conclusion and we'll make sure we're all on the same page (again we forgot to write it down in real-time) |
An attempt at enumerating the various setitem-like methods, with some granularity into degree of setitemlikeness * For this discussion, "in-place" refers to writing to an underlying array, which is not equivalent to the Fully setitem-like methods are those that currently may operate in-inplace:
Methods that take a
|
(We actually did take some notes about it this time (although a bit terse ;)), and it was on my TODO list to report back here. Thanks for the ping!) So while briefly discussing this at the meeting last week, there was the useful feedback that also for other (not purely setitem related) operations, pandas will often liberally upcast dtypes, and also for those operations it might be useful to be (or have the option to be) more strict about dtypes. In addition to the setitem-like cases discussed here (the two bullet points at #39584 (comment) for an overview), two other groups of cases mentioned:
The "concat" case is quite different and certainly deserves a separate discussion (it's on my TODO list to open one, but if someone is interested in this, feel free to do it before me). |
@jbrockmendel few specific things about your list above (#39584 (comment)): |
It doesn't take
There is a Series.mask, but you're right about putmask in particular.
ill defer to you on how to keep the discussion narrow. my focus ATM is on trying to go from many implementations of casting logic to just one. |
I don't think we can/want have only one. For example, at least the casting logic for setitem and for concat could be different.
The same is true for We can have a lot of discussions about which method to include in which group, but so I would propose to start with the actual |
@jorisvandenbossche this was on the agenda for the Feb 2021 call but i don't see any notes about the discussion. I have a vague recollection that we discussed a less-strict (not mutually-exclusive) restriction that would prevent silent casting to object, but allow e.g. int->float. Do you have any memory or opinion on this? |
The notes are basically above (#39584 (comment), and your and my comment below that). Now, I don't really remember something about a less-strict casting (except that we basically have this right now for concat (through "common_dtype"), although there you also still get a silent cast to object dtype at the moment if there is no common dtype). Do you remember for what kind of operation we might have talked about this? |
I think there was discussion about For the most part The OP suggestion is to deprecate allowing both 1) and 2). The less-strict suggestion is to deprecate only 2), without making a decision about deprecating 1). The benefits as I see them are a) object-casting cases are the most likely to be accidental, e.g. with dt64 a typo "2016-01-01" -> "2p16-01-01". |
I find the rules to be a bit inconsistent. For example import pandas as pd
a = pd.Series([1,2,3],dtype="i4")
a.iloc[0] = 2**33 + 1.1 # OK
b = pd.Series([1,2,3],dtype="i4")
b.iloc[0] = 2**33 + 1.0 # raises because it is cast to int before setting Similarly
|
@bashtage what version of pandas are you using? For me, all those example silently pass (and in the case of
(which seems like a bug) |
@jorisvandenbossche This was run against master on Windows. It is probably unrelated but I wanted to use master since I had a number of problems with future changes like |
Marking as blocker for 1.4rc; we should make a decision about deprecation before then. |
@jorisvandenbossche for If not, then I think the scope would also need expanding to pandas/pandas/core/internals/blocks.py Lines 1429 to 1450 in e38daf0
e.g.: ser = pd.Series(period_array(["2000", "2001", "NaT"], freq="D"))
cond = np.array([True, False, True])
other = pd.Period("2000", freq="H")
print(ser.where(cond, other)) This would be a quite noticeably change, and from the last dev call, there wasn't a consensus on whether this should be done everywhere or whether upcasting should be allowed in certain cases, like |
I recently upgraded to pandas 2.11 and am now trying to solve the hundreds of FutureWarnings in my code. However, I find this new behavior of not changing automatically from integer to float pretty hard to internalize and not very consistent. In general, when dealing with all-integer DataFrames I find it confusing that changing the column will work, while changing the row will not. So far, I have seen it as a main advantage of pandas (also compared to data.frame in R) that I usually don't have to worry about datatypes too much. import pandas as pd
df = pd.DataFrame({"X": [1, 2], "Y":[3, 4]}, index=["A", "B"])
df.loc[:,"Y"] += 0.1 # allowed
df.loc["A",:] += 0.1 # raises FutureWarning What I find especially counterintuitive is that now the order of execution matters more often. df.loc[:,"Y"] += 0.1 # allowed, and changes "Y" to float64
df.loc["A","Y"] += 0.5 # allowed If I do it the other way round, if will fail: df.loc["A","Y"] += 0.5 # raises FutureWarning
df.loc[:,"Y"] += 0.1 I guess the discussion on this topic has ended, but personally I would very much like to get the old behavior back, where integer columns automatically change to float when necessary. |
thanks @UlrichKreidenweis for the report The general rule is that the deprecation doesn't apply if you're replacing the entire column In the case of In the case of So, to me this looks like it's working as expected As for avoiding the warning, you could not do the operation in place, e.g.: In [56]: pd.concat([df[col].where(df.index!='A', df[col]+.1) for col in df.columns], axis=1)
Out[56]:
X Y
A 1.1 3.1
B 2.0 4.0 Or just cast everything to float to begin with |
@MarcoGorelli many thanks for the quick reply. I think I understand the current behavior and agree that it's probably working as intended. All I was trying to say was that this change has not improved my personal user experience. The only slight inconsistency that I have found so far is that when changing the whole column, but by naming all indexes this also raises a warning: import pandas as pd
df = pd.DataFrame({"X": [1, 2], "Y":[3, 4]}, index=["A", "B"])
df.loc[["A", "B"],"Y"] += 0.1
# or like this
df.loc[df.index, "Y"] += 0.1 |
From some discussion with @phofl and @jbrockmendel , it seems that df.loc[:,"Y"] += 0.1 should indeed be warning as well, as that it also inplace will take a look, and thanks again for the report |
Discussion today with @phofl: in theory, But, even enabling copy-on-write at the moment, (.venv) marcogorelli@DESKTOP-U8OKFP3:~/tmp$ cat t.py
import pandas as pd
pd.options.mode.copy_on_write = True
df = pd.DataFrame({'a': [1,2,3], 'b': [4,5,6]})
df.loc[:, 'a'] += .1
print(df)
(.venv) marcogorelli@DESKTOP-U8OKFP3:~/tmp$ python t.py
a b
0 1.1 4
1 2.1 5
2 3.1 6 so this needs further investigation in any case |
It seems quite odd to me that In any case, is most of this issue handled by PDEP-6? I would suggest creating new issues for implementation issues if that is the case. |
yeah I don't think the comment that it should be a no-op is correct anyway, agree - let's close and discuss the issue in #55791, which seems to be the same thing, thanks all! |
Currently, setitem-like operations (i.e. operations that change values in an existing series or dataframe such as
__setitem__
and.loc
/.iloc
setitem, or filling methods likefillna
) first try to update in place, but if there is a dtype mismatch, pandas will upcast to a common dtype (typically object dtype).For example, setting a string into an integer Series upcasts to object:
or doing a
fillna
with an invalid fill value also upcasts instead of raising an error:My general proposal would be that in some future (eg pandas 2.0 + after a deprecation), such inherently inplace operation should have the guarantee to either happen in place or either error, and thus never change the dtype of the original Series/DataFrame.
This is similar to eg numpy's behaviour where setitem never changes the dtype. Showing the first example from above in equivalent numpy code:
Apart from that, I also think this is the cleaner behaviour with less surprises. If a user specifically wants to allow mixed types in a column, they can manually cast to
object
dtype first.On the other hand, this is quite a big change in how we generally are permissive right now and easily upcast, and such a change will certainly impact quite some user code (but, it's perfectly possible to do this with proper deprecation warnings in advance warning for the specific cases where it will error in the future AFAIK).
There are certainly some more details that need to discussed as well if we want this (which exact values are regarded as compatible with the dtype, eg setting a float in an integer column, should that error or silently round the float?). But what are people's thoughts on the general idea?
cc @pandas-dev/pandas-core
The text was updated successfully, but these errors were encountered: