-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++][Python] utf8_slice_codeunits produces invalid unicode sequence #36311
Comments
It gets even better in debug mode: >>> import pyarrow.compute as pc
>>> import pyarrow as pa
>>>
>>> pa.compute.utf8_slice_codeunits(f"AB{chr(127917)}C{chr(127917)}ㇱD", start=2, stop=None, step=4)
/home/antoine/arrow/dev/cpp/src/arrow/compute/kernels/scalar_string_internal.h:109: Check failed: (output_ncodeunits) <= (max_output_ncodeunits) @rok Would you like to take a look? |
Even a trivial identity slice fails: >>> s = f"AB{chr(127917)}C{chr(127917)}ㇱD"
>>> a = pa.array([s])
>>> pc.utf8_slice_codeunits(a, start=0, stop=None)
Traceback (most recent call last):
Cell In[11], line 1
pc.utf8_slice_codeunits(a, start=0, stop=None)
File ~/arrow/dev/python/pyarrow/compute.py:261 in wrapper
return func.call(args, options, memory_pool)
File pyarrow/_compute.pyx:367 in pyarrow._compute.Function.call
result = GetResultValue(
File pyarrow/error.pxi:144 in pyarrow.lib.pyarrow_internal_check_status
return check_status(status)
File pyarrow/error.pxi:100 in pyarrow.lib.check_status
raise ArrowInvalid(message)
ArrowInvalid: Negative buffer resize: -4
/home/antoine/arrow/dev/cpp/src/arrow/memory_pool.cc:931 buffer->Resize(size)
/home/antoine/arrow/dev/cpp/src/arrow/compute/kernels/scalar_string_internal.h:88 ctx->Allocate(max_output_ncodeunits)
/home/antoine/arrow/dev/cpp/src/arrow/compute/exec.cc:920 kernel_->exec(kernel_ctx_, input, &output)
/home/antoine/arrow/dev/cpp/src/arrow/compute/function.cc:276 executor->Execute(input, &listener) Note that the input is valid: >>> a.validate(full=True)
>>> |
@felipecrv @benibus Perhaps one of you is interested as well. |
I can try taking a look ...might be nice to do some unicode debugging again |
I can take a look but need to complete something else first. Will ping if I start work. |
@pitrou should this be a blocker? |
I don't think it needs to. |
Fix for this is in progress, but it seems the default value for Seems kinda crazy that this hasn't come up before actually... since the specific uinicode sequence doesn't appear to be relevant. |
### Rationale for this change The default value for the `SliceOptions::stop` is `INT64_MAX`, which isn't considered in several internal calculations - resulting in integer overflows and unexpected behavior when `stop` isn't provided. Also note that running the included tests without the fixes should result in ubsan errors (it did for me, at least). ### What changes are included in this PR? - Adds some logic to `SliceCodunitsTransform` that handles potential overflows - Adds tests for cases where the `start` param is positive/negative and `stop` is the maximum value **Update** Discovered that `utf8_slice_codeunits` deviates from Python array behavior when `stop=None` and `step < 0`, so further changes were made: - Handles `INT64_MIN` for `SliceOptions::stop` on C++ side, adds more tests. - Updates Python bindings for `SliceOptions` so that the default value when `stop=None` (`sys.maxsize`) is negated when `step < 0` - Adds `None` as a possible `stop` value in Python tests ### Are these changes tested? Yes (tests are included) ### Are there any user-facing changes? In theory, altering the behavior of `utf8_slice_codepoints` when `stop=None` and `step < 0` could be considered a breaking change. That being said, the current implementation produces incorrect results whenever `None` is even used, so it probably isn't one in practice... * Closes: #36311 Authored-by: benibus <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
### Rationale for this change The default value for the `SliceOptions::stop` is `INT64_MAX`, which isn't considered in several internal calculations - resulting in integer overflows and unexpected behavior when `stop` isn't provided. Also note that running the included tests without the fixes should result in ubsan errors (it did for me, at least). ### What changes are included in this PR? - Adds some logic to `SliceCodunitsTransform` that handles potential overflows - Adds tests for cases where the `start` param is positive/negative and `stop` is the maximum value **Update** Discovered that `utf8_slice_codeunits` deviates from Python array behavior when `stop=None` and `step < 0`, so further changes were made: - Handles `INT64_MIN` for `SliceOptions::stop` on C++ side, adds more tests. - Updates Python bindings for `SliceOptions` so that the default value when `stop=None` (`sys.maxsize`) is negated when `step < 0` - Adds `None` as a possible `stop` value in Python tests ### Are these changes tested? Yes (tests are included) ### Are there any user-facing changes? In theory, altering the behavior of `utf8_slice_codepoints` when `stop=None` and `step < 0` could be considered a breaking change. That being said, the current implementation produces incorrect results whenever `None` is even used, so it probably isn't one in practice... * Closes: #36311 Authored-by: benibus <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
Describe the bug, including details regarding any error messages, version, and platform.
The result of utf8_slice_codeunits produced an invalid unicode sequence.
Component(s)
Python
The text was updated successfully, but these errors were encountered: