-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DEPR: Remove int64 uint64 float64 index part 1 #49560
DEPR: Remove int64 uint64 float64 index part 1 #49560
Conversation
844ab1b
to
5ff5584
Compare
0dc2660
to
b332214
Compare
527be72
to
df7afae
Compare
Out of curiosity: the approach I would have taken would have been to start with changing |
Yes, possibly. My main problem is that on my computer intp means int64, while on some machines intp means int32. Also there have been some smaller bugs that I've found working on this that slowed me down. I try again tonight or tomorrow and I will try the |
73bde44
to
99c82ae
Compare
b421767
to
7e11517
Compare
f81f0df
to
25c71c7
Compare
9a64c7f
to
8020af4
Compare
Alright, it's all green finally! The failure in This is quite a big PR, with lot of dtypes changes stemming from the fact that in this PR |
6655dfe
to
4c337f6
Compare
Updated after #50195 was merged + rebased. The remaining failure looks unrelated. |
if self._is_backward_compat_public_numeric_index: | ||
# this block is needed so e.g. NumericIndex[int8].astype("int32") returns | ||
# NumericIndex[int32] and not Int64Index with dtype int64. | ||
if not self._is_backward_compat_public_numeric_index and not isinstance( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this chunk of code is going to be removed in a follow-up right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, It will be removed in the followup #50052.
@@ -212,7 +217,15 @@ def test_series_from_coo(self, dtype, dense_index): | |||
|
|||
A = scipy.sparse.eye(3, format="coo", dtype=dtype) | |||
result = pd.Series.sparse.from_coo(A, dense_index=dense_index) | |||
index = pd.MultiIndex.from_tuples([(0, 0), (1, 1), (2, 2)]) | |||
|
|||
index_dtype = np.int64 if dense_index else np.int32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it clear we want these to be different? if not, can you add a TODO comment to that effect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I think they should be the same, but we need to decide which dtype should be used.
We have for both 32-bit and 64-bit systems:
>>> import scipy.sparse
>>> A = scipy.sparse.eye(3, format="coo", dtype=dtype)
>>> A.row, A.col
(array([0, 1, 2], dtype=int32), array([0, 1, 2], dtype=int32))
I.e. scipy.sparse always uses int32 for index, also on 64-bit systems. The pandas way OTOH is to default to 64-bit dtypes, if an index dtype hasn't been specified, also on 32-bit systems.
So the question is if we should use the pandas or the scipy convention for index dtype when calling pd.Series.sparse.from_coo
.
My intuition is to follow the scipy way is this instance (i.e. use 32-bit), as pd.Series.sparse.from_coo
is about converting an existing scipy data structure, so we should follow that data structure's conventions, where possible. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
id add a TODO comment saying roughly what you said here and then punt on it
@@ -445,7 +445,8 @@ def DecimalArray__my_sum(self): | |||
result = df.groupby("id")["decimals"].agg(lambda x: x.values.my_sum()) | |||
tm.assert_series_equal(result, expected, check_names=False) | |||
s = pd.Series(DecimalArray(data)) | |||
result = s.groupby(np.array([0, 0, 0, 1, 1])).agg(lambda x: x.values.my_sum()) | |||
grouper = np.array([0, 0, 0, 1, 1], dtype=np.int64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think i asked elsewhere: specifying this is benign, just affects result.index.dtype or something like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this just affects result.index.dtype
.
@@ -388,7 +388,7 @@ def test_to_csv_dup_cols(self, nrows): | |||
|
|||
@pytest.mark.slow | |||
def test_to_csv_empty(self): | |||
df = DataFrame(index=np.arange(10)) | |||
df = DataFrame(index=np.arange(10, dtype=np.int64)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is bc when we do to_csv followed by read_csv it doesnt roundtrip int32?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, read_csv can't know the dtypes of its read-in columns, so it gives dtype int64 to integer columns and indexes.
# add / sum | ||
result = df.groupby(pd.cut(df["a"], grps), observed=observed)._cython_agg_general( | ||
"sum", alt=None, numeric_only=True | ||
) | ||
intervals = pd.interval_range(0, 20, freq=5) | ||
intervals = pd.IntervalIndex.from_breaks(np.arange(0, 21, 5)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i thought IntervalArray converted 32bit ints to 64?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is true. This changed line is from before IntervalIndex was ensured to be 64-bit, so this change is not needed anymore; the old and the new line are equivalent now. I'll revert this.
pandas/tests/window/test_groupby.py
Outdated
# https://github.com/twosigma/pandas/issues/53 | ||
dtype = np.dtype(any_int_numpy_dtype).type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we call this something other than dtype, since it isnt one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I can call it typ
.
pandas/tests/groupby/test_nunique.py
Outdated
@@ -172,7 +172,7 @@ def test_nunique_preserves_column_level_names(): | |||
# GH 23222 | |||
test = DataFrame([1, 2, 2], columns=pd.Index(["A"], name="level_0")) | |||
result = test.groupby([0, 0, 0]).nunique() | |||
expected = DataFrame([2], columns=test.columns) | |||
expected = DataFrame([2], index=np.array([0], dtype=np.int_), columns=test.columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the np.int_ here needed? ive been assuming that the index=np.array([0])
ive been seeing has been there to ensure the indexes get np.int_ dtype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, not needed anymore. This is from before #49815 was merged. I'll update it.
I'll wait for your response to the question about the |
pandas/tests/reshape/test_cut.py
Outdated
@@ -87,7 +87,9 @@ def test_bins_from_interval_index_doc_example(): | |||
# Make sure we preserve the bins. | |||
ages = np.array([10, 15, 13, 12, 23, 25, 28, 59, 60]) | |||
c = cut(ages, bins=[0, 18, 35, 70]) | |||
expected = IntervalIndex.from_tuples([(0, 18), (18, 35), (35, 70)]) | |||
expected = IntervalIndex.from_arrays( | |||
left=np.array([0, 18, 35]), right=np.array([18, 35, 70]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a big deal but why is this necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, not needed after #50195. I'll revert.
[ | ||
([0, 1, 2, 3], [0, 0, 1, 1]), | ||
([0], [0]), | ||
*[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i find this pattern difficult to read. is there a viable alternative?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a new param @pytest.mark.parametrize("dtype", [None] + tm.ALL_INT_NUMPY_DTYPES)
and then
if dtype is not None:
data = np.array(data, dtype=dtype)
groups = np.array(groups, dtype=dtype)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I can do that.
(timedelta_range("1 day", periods=10), "<m8[ns]"), | ||
] | ||
) | ||
def breaks_and_expected_subtype(self, request): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this the same as the one above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be removed entirely now. I'll update.
@@ -29,7 +29,8 @@ class TestIntervalRange: | |||
@pytest.mark.parametrize("freq, periods", [(1, 100), (2.5, 40), (5, 20), (25, 4)]) | |||
def test_constructor_numeric(self, closed, name, freq, periods): | |||
start, end = 0, 100 | |||
breaks = np.arange(101, step=freq) | |||
dtype = np.int64 if is_integer(freq) else np.float64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as elsewhere, isnt this done automatically now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, after #50195... I'll revert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some questions, no blockers. LGTM cc @mroeschke
Updated. All comments have been addressed AFAIKT. |
Awesome, thanks @topper-123 |
this is huge, thanks @topper-123 |
This PR makes progress towards removing Int64Index, UInt64Index & FloatIndex from the code base, see #42717.
In this PR I make instantiation of
Index
with numpy numeric dtypes return aNumericIndex
with the given dtype rather than converting to the old 64 bit only numeric index types, making all numeric dtypes available for indexes.This is just the first part of the changes planned. In follow-ups PRs I will:
NumericIndex
into the baseIndex
and removeNumericIndex
related PRs
A part of the problems I had in #49494 was a few corner case bugs. They're pulled out into seperate PRs (#49536 & #49540). They're present in this PR at present, but will be removed after those seperate PRs have been merged.
Notable changes from this PR:
As mentioned in #49494, one notable change from the PR is that because we now have all the numeric dtypes available for indexes, indexes created from DateTimeIndex (day, year etc.) will now be in int32, where previously they were forced to int64 (because that was the only index integer dtype available).
Using int32 is the more correct way, because the underlying
DateTimeArray
returns 32bit arrays. An example of this:The above will be included in the doc changes, when I do the docs
Progress towards #42717. See also #41272