-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: NumericIndex for any numpy int/uint/float dtype #41153
Conversation
Probably stupid (and maybe already discussed) question, but I see this as two steps:
... point 1) requires no deprecation. Do we actually need 2)? I mean: I do see we declutter the namespace; I'm just not sure this is the best thing for our users. |
Hey @toobaz. I haven't discussed this with anyone, wasn't sure if I'd end up making this work or not :-) Wrt. your first point, I'm not sure I understand, but I've seperated out fixes to bugs in the code base (in #41063, #41073 (already merged) + #41132 and #41121, (not yet merged, present in this PR, so this PR will (hopefully ) pass the CI)). To you means any other part that should be seperated out? The changes in Wrt. your second point, do you mean we should hide |
Sorry, I was rather suggesting the opposite: not deprecating My point is just: it's great to 1) have all numeric types supported by a same class and 2) supporting new types. But we can do this in a fully backwards compatible (and yes, I understand this PR is backwards compatible, I'm referring more to the longer term plan), and maybe simpler for users, way. By the way, I think that what I'm proposing is mostly compatible with the current PR, with the possible exception of the docs and of the places where you instantiate a (But it's an important API decision, so before getting into these details it's probably good to collect other devs' advice) |
Ok, thanks for clarifying. My own opinion is that I like better a common container type for content that is similar, and more specific differentiation would be done by looking at the dtype attribute. That is also mostly what pandas does, e.g. >>> pd.array([1,2,3], dtype="float64")
<PandasArray>
[1.0, 2.0, 3.0]
Length: 3, dtype: float64 Though we also have a I think I will wait with further development on this PR until other devs chime in. |
Sidenote: you shouldn't look too much at |
In general, I am fully +1 of supporting all numeric dtypes as Index, instead of limiting it to the 64-bit version. For how to expose this to the user, is it correct to summarize the different options as:
And in both the first and third option, we can initially ensure that the existing Personally, and speaking from a user perspective, and specifically for the numeric indexes, the subclasses don't add that much value. For example, the |
That is a good summary of the options, @jorisvandenbossche. Wrt. option 3, I agee that would be the best interface, thought I have two reservations:
What are your thoughts on backwards compatability, if we join the numeric index dtypes into |
I think option 3, while if we started over might be fine is almost impossible from a back compat perspective. Option 2 is thus appropriate. We could possibly move to option 3 in a future version, but way better to do this in multiple steps. |
I also think having less classes is not necessarily good for users - while I accept that my proposal of creating new ones looks inelegant, I do think that at least generic EDIT: unless the choice was to really have only one class, |
I think the same issue arises when using There might also be a somewhat "in-between" option: already implement it in the
@toobaz how are they different for a user? The only difference would be the value of (it's true that it would be somewhat inconsistent with still having dtype-specific sublcasses for DatetimeIndex, IntervalIndex etc. But having a single NumericIndex that covers several dtypes is already inconsistent with that pattern in itself as well .. so personally I don't see that as a problem) |
True, but anyway, the consequence is that By the way, an So for me seeing and
True, both of these things could be conveyed by the |
I've made a updated version, where
I'll look into the architecture issue later tonight. |
Thanks for the comments. My views:
Yes, that was my idea. That would keep this addition 100 % backwards compatible. Users who want e.g. int16, would do the explicit
I agree, but that is a trade-off between keeping compatability and improving the architecture. Not easy, but I think this is a relatively large change, so my preference is keeping backward compat untill pandas 2.0.
This would be easy to implement. Most people wouldn't be affected, but there will always be someone who do checks by doing e.g. So if we change how BTW, I like the suggestion from @jreback to do this stepwise. That would also make working on this more "relaxed", (The exact public API wouldn't have to be finished in this one PR). For example, the
No matter which choice is made, this PR will be a step in the right direction. I could open an issue, where we further discuss the choice for public API. |
BTW, I still need to look at some tests for Int64Index etc., if they should also be used for |
+1 |
I’ve updated and rebased. |
Ping. |
thanks @topper-123 really nice, sorry for the delay. if you can rebase the doc one we can merge that too. |
This PR presents a unified index type called
NumericIndex
for numerical indexing, supporting int64/32/16/8, uint64/32/16/8 and float64/32/16 numpy dtypes. This means that indexes with other numeric numpy dtypes than int64, uint64 and float64 will become possible. This index type is therefore functionally a super type ofInt64Index
,UInt64Index
andFloat64Index
and is intended to eventually replace those.Comments very welcome, of course, as this is probably quite a fundamental change to Pandas. @pandas-dev/pandas-core.
Status
Missing is especially:
NumericIndex
as experimental in 1.3?Deprecation cycle
ThIs PR is intended to be fully backwards compatible.
This means that public operations that currently return e.g. a
Int64Index
, will continue working unchanged. My view is thatNumericIndex
should replaceInt64Index
,UInt64Index
andFloat64Index
in Pandas 2.0 and those index types should therefore be removed by 2.0. They should therefore get a deprecation warning in timely manner, before 2.0.Internally there are some use of especially
Int64Index
, for example in groupby and join ops. That could be replaced withNumricIndex
straight away, giving a performance benefit in cases where casting to higher dtypes can be avoided.Outside of scope of this PR
I do not intend to add support for nullable numeric dtypes in this PR. This is partly for technical reasons (there isn't any indexing engines for those types currently) but also to limit the scope of this PR. I'm not opposed to add support for those dtypes later, but I think adding them here would be too much for this PR. Also, it might be more sensible to have extension types in a different index class than numpy types, so probably better to look into that seperately.
Actual deprecation of
Int64Index
,UInt64Index
andFloat64Index
should also be in a later PR IMO. If Pandas 2.0 is far in the future, it may be better holding off deprecating them until we see a v2.0 in the horizon instead of having a very long deprecation cycle for a relatively often used part of Pandas.