-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: ensure IntervalIndex.left/right are 64bit if numeric, part II #50195
Conversation
76e7938
to
0f96b05
Compare
0f96b05
to
d38b66c
Compare
pandas/core/arrays/interval.py
Outdated
@@ -284,7 +306,10 @@ def _simple_new( | |||
from pandas.core.indexes.base import ensure_index | |||
|
|||
left = ensure_index(left, copy=copy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to handle this somehow in the creation of the IntervalIndex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to remove ensure_index
in various ways, but failed. It looks like there is some dtype issues that need to passed through an Index to be solved, but I didn't manage to untangle it, unfortunately.
There is a lot going on in IntervalArray._simple_new
. I've looked into moving all the validation/dtype wrangling there into a separate function. That would mean that _simple_new
would become much more simple and it would much simpler to instantiate an IntervalArray
, when we can be sure the input data is correct. I'll push a PR about this shortly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did change this in the newest version, could you take a look?
assert result is key | ||
if not isinstance(result, NumericIndex): | ||
assert result is key | ||
else: | ||
expected = NumericIndex(key) | ||
tm.assert_index_equal(result, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than adding logic to the test (which can hide bugs), is it possible to either:
- include the expected result in the parametrisation
- OR split the test out into two separate ones, one of which uses
assert result is key
and the othertm.assert_index_equal(result, expected)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, doesn't this test also pass on upstream/main
? Is there a way to write it such that it fails there, but passes here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't get time to address this question right now, sorry. I'll get back to this tonight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I`ve updated the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for updating - any chance you could address the logic in the test comment too please?
IIRC when I tried executing this, it was just one of the make_key
inputs which required a different assertion
If so, then can the assertion either be included in the parametrisation, or the test be split into two?
For reference, this advice comes from: https://testing.googleblog.com/2014/07/testing-on-toilet-dont-put-logic-in.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah good points in the article, very nice to have it articulated.
I've changed the PR. I started by separating the test into two tests split by type of make_key
. However, I didn't like having two very similar tests and I didn't like having lambdas in parametrization, (because lambdas are difficult to introspect, and having several lambdas means it's difficult to see which test you're looking at when debugging). so I've made a new version.
I prefer the newest version (avoiding lambdas, clear inputs into the test function), but will await your comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, looks better, thanks for updating
pandas/core/arrays/interval.py
Outdated
@@ -284,7 +306,10 @@ def _simple_new( | |||
from pandas.core.indexes.base import ensure_index | |||
|
|||
left = ensure_index(left, copy=copy) | |||
left = maybe_convert_numeric_to_64bit(left) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just for my understanding, what's an example of where this makes a difference? the test you've modifed passes even without this change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I took the wrong approach here originally.
The issue that this was supposed to solve is that on 32-bit systems e.g. IntervalArray.from_breaks([1, 2, 3])
should give an array with dtype interval[int64, right]
to align with the convention in pandas that lists in constructors should interpreted as 64-bit (e.g. Series([1, 2, 3])
and Index([1, 2, 3])
both give int64 dtype even on 32-bit systems). Previously, (after #49560) giving lists to IntervalArray
gave interval[int32, right]
. This affected some tests in #49560 which is the reason I have taken this up.
In the newest version I moved this logic to a _maybe_convert_platform_interval
, which IMO should be the better location for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this issue on 32-bit systems is only with integer dtypes as e.g. np.asarray([1.5])
will always have float64 dtype. So I've made this simpler in the newest version by just checking for integer dtype and converting to int64
if needed.
pandas/core/arrays/interval.py
Outdated
if not is_array_like(arr): | ||
return arr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just for my understanding, what's an example of where this makes a difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was just a short-circuit, so the functions breaks early, if the value can't possibly be array-like, it made no funcional difference. This will overall probably not be an improvement as the arr
in the current version now can't be non-array plus the function is typed, so I can remove it again.
I moved the function to core.dtypes.cast
and renamed it maybe_upcast_numeric_to_64bit
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah would remove this if not necessary. This would get me guessing how to get here if I would want to make a change in a couple of weeks
d38b66c
to
4117441
Compare
Sorry for the late response, I had some things I had to attend to in the weekend. I respond to you comments individually above. I did look into this again and agree that some of the suggestions in my original PR could be improved upon (especially the changes to |
The failed check is unrelated. |
5123e34
to
f58478b
Compare
Rebased to make the CI run again. No other changes have been made. |
Ping. As far as I see all comments have been addressed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sticking with this
My only comment is about
if not is_array_like(arr):
return arr
, if this isn't covered by any tests, then TBH I'd prefer to keep it out
Other than that, I don't have any objections, but I'm not familiar enough with this part of the codebase to merge, so I'll hand over to @phofl
👍 I've remove that code section. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating! No objections - approving to remove my 'requested changes', but handing over to others with more expertise in this before merging
bd90f1b
to
872c199
Compare
pandas/core/arrays/interval.py
Outdated
def maybe_convert_numeric_to_64bit(arr: NumpyIndexT) -> NumpyIndexT: | ||
# IntervalTree only supports 64 bit numpy array | ||
dtype = arr.dtype | ||
if not np.issubclass_(dtype.type, np.number): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this different from is_numeric_dtype?
872c199
to
880d51f
Compare
Ping. I've rebased because this has been standing still for a bit. |
thx @topper-123 |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.Follow-up to #50130. It turned out that
IntervalArray.from_array
could get around the 64bit requirement, so we fix that by movingmaybe_convert_numeric_to_64bit
and and using it in theIntervalArray
constructor also.Also return the 64bit index in
IntervalIndex._maybe_convert_i8
, previously we returnedoriginal
, which was the not-64bit-converted one...(this changes a test, but it’s just for an internal method).