-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: numexpr 2.85 changed integer overflow handling, failing a test #54546
Comments
This is plausibly the result of numexpr fixing pydata/numexpr#434 : int ** int in numexpr used to explicitly fail, making us fall back to the non-numexpr method, but no longer does. It might be possible to work around this by not using numexpr for **, but presumably with a performance cost. |
pandas doesn't have an obviously documented policy on integer overflow, and numpy's policy appears to be the same as C's (i.e. signed integer overflow is undefined). This suggests treating this as undefined / not-a-bug and changing the test's input to something that doesn't overflow (suggestion, not tested: However, this test was originally added in #31296 and referencing a similar non-numexpr one in #31271 testing that column-by-column and whole-DataFrame operations gave the same result, specifically for division by zero. Is that an implied policy that column-by-column and whole-DataFrame results must always be the same (and hence that numexpr and non-numexpr results must be the same, as it is easily possible for the whole frame to be big enough to default to numexpr while a single column is not)? No reason was stated there for the power test using almost-all-overflowing input; it may have been copied from the division test. |
Yes, these should always match.
No idea off the top of my head.
I don't think this is non-obviously documented either, so "policy" is probably too strong a word. In general we'd rather raise than silently give incorrect results. What constitutes "incorrect" depends on if the user is expected modular arithmetic. My rule of thumb is that with signed int64 they are not, and with other integer dtypes they plausibly could be. That may just be me though. |
The C (and numpy?) policy is that unsigned ints must wrap (modular arithmetic) but signed ints (of any width) can do anything.
Does that mean that while signed ints can do anything on overflow, they must do the same thing in both cases, and hence that we can't use numexpr where its overflow behaviour doesn't match plain pandas? If so, we can't use numexpr for int ** int. (This didn't come up before because numexpr int ** int used to explicitly fail (in all cases) due to bug pydata/numexpr#434.) I don't know whether the other operators currently follow this rule: this test only tests overflowing input for pow, and I don't know what either numexpr or plain pandas do on overflow of other operations. |
The non-pow operators (at least add, sub, mul) have wraparound overflow in numexpr and hence don't have this issue - it seems only pow gives -2**63 on overflow. |
It looks like the reason that pow is different from the other numexpr operators is that it's implemented by casting to double and back. The first part of this is an attempt to stop using it, but it does both more and less than we actually want: it disables numexpr pow in all dtypes (not just ints), but only in plain operators (not .eval()). |
@mroeschke i dont think we have a documented policy on anything discussed in this issue, do we? The easiest solution on our end would be to not use numexpr for any of the affected methods, but that seems like leaving a lot of speedup on the table |
Yeah I think we have a hidden heuristic (#53781) when numexpr is used, but if integer overflow changed in theory it would affect all methods? |
@mroeschke @jbrockmendel as noted above, the non-pow methods (that I've tested) have the same overflow behaviour in numexpr and plain pandas, so aren't affected. Also, what changed isn't numexpr's overflow behaviour, but whether numexpr int**int works at all - it used to always explicitly fail (due to pydata/numexpr#434), causing pandas to fall back to the non-numexpr method. Given the lack of a policy, I consider deciding to ignore/accept this (i.e. change the test to a non-overflowing input) a reasonable option, but others might disagree. |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
With numexpr 2.8.5 it instead returns -2**63, and hence the test fails.
=================================== FAILURES ===================================
483s _____ TestFrameFlexArithmetic.test_floordiv_axis0_numexpr_path[python-pow] _____
483s
483s self = <pandas.tests.frame.test_arithmetic.TestFrameFlexArithmetic object at 0x7fa71aebc1d0>
483s opname = 'pow'
483s
483s @pytest.mark.skipif(not NUMEXPR_INSTALLED, reason="numexpr not installed")
483s @pytest.mark.parametrize("opname", ["floordiv", "pow"])
483s def test_floordiv_axis0_numexpr_path(self, opname):
483s # case that goes through numexpr and has to fall back to masked_arith_op
483s op = getattr(operator, opname)
483s
483s arr = np.arange(_MIN_ELEMENTS + 100).reshape(_MIN_ELEMENTS // 100 + 1, -1) * 100
483s df = DataFrame(arr)
483s df["C"] = 1.0
483s
483s ser = df[0]
483s result = getattr(df, opname)(ser, axis=0)
483s
483s expected = DataFrame({col: op(df[col], ser) for col in df.columns})
483s > tm.assert_frame_equal(result, expected)
483s
483s /usr/lib/python3/dist-packages/pandas/tests/frame/test_arithmetic.py:510:
483s _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
483s
483s > ???
483s
483s pandas/_libs/testing.pyx:52:
483s _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
483s
483s > ???
483s E AssertionError: DataFrame.iloc[:, 0] (column name="0") are different
483s E
483s E DataFrame.iloc[:, 0] (column name="0") values are different (99.99 %)
483s E [index]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...]
483s E [left]: [1, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, -9223372036854775808, ...]
483s E [right]: [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]
483s
Expected Behavior
The test should pass.
I don't know whether pandas has documented integer overflow behaviour, but if it does it should follow it.
Installed Versions
Happens with numexpr 2.8.5 and not with 2.8.4. (This is not #54449, though I do also see that bug - that's an explicit exception, this is a changed answer.)
Seen in both pandas 1.5.3 and pandas 2.0.3.
The text was updated successfully, but these errors were encountered: