[C++][Python] utf8_slice_codeunits produces invalid unicode sequence #36311

wirable23 · 2023-06-26T21:54:40Z

Describe the bug, including details regarding any error messages, version, and platform.

pa.compute.utf8_slice_codeunits(f"AB{chr(127917)}C{chr(127917)}ㇱD", start=2, stop=None, step=4).as_py()

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow\scalar.pxi", line 632, in pyarrow.lib.StringScalar.as_py
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 4: invalid start byte
>>>

The result of utf8_slice_codeunits produced an invalid unicode sequence.

Component(s)

Python

The text was updated successfully, but these errors were encountered:

pitrou · 2023-06-27T18:01:01Z

It gets even better in debug mode:

>>> import pyarrow.compute as pc
>>> import pyarrow as pa
>>> 
>>> pa.compute.utf8_slice_codeunits(f"AB{chr(127917)}C{chr(127917)}ㇱD", start=2, stop=None, step=4)
/home/antoine/arrow/dev/cpp/src/arrow/compute/kernels/scalar_string_internal.h:109:  Check failed: (output_ncodeunits) <= (max_output_ncodeunits)

@rok Would you like to take a look?

pitrou · 2023-06-27T18:03:29Z

Even a trivial identity slice fails:

>>> s = f"AB{chr(127917)}C{chr(127917)}ㇱD"
>>> a = pa.array([s])
>>> pc.utf8_slice_codeunits(a, start=0, stop=None)
Traceback (most recent call last):
  Cell In[11], line 1
    pc.utf8_slice_codeunits(a, start=0, stop=None)
  File ~/arrow/dev/python/pyarrow/compute.py:261 in wrapper
    return func.call(args, options, memory_pool)
  File pyarrow/_compute.pyx:367 in pyarrow._compute.Function.call
    result = GetResultValue(
  File pyarrow/error.pxi:144 in pyarrow.lib.pyarrow_internal_check_status
    return check_status(status)
  File pyarrow/error.pxi:100 in pyarrow.lib.check_status
    raise ArrowInvalid(message)
ArrowInvalid: Negative buffer resize: -4
/home/antoine/arrow/dev/cpp/src/arrow/memory_pool.cc:931  buffer->Resize(size)
/home/antoine/arrow/dev/cpp/src/arrow/compute/kernels/scalar_string_internal.h:88  ctx->Allocate(max_output_ncodeunits)
/home/antoine/arrow/dev/cpp/src/arrow/compute/exec.cc:920  kernel_->exec(kernel_ctx_, input, &output)
/home/antoine/arrow/dev/cpp/src/arrow/compute/function.cc:276  executor->Execute(input, &listener)

Note that the input is valid:

>>> a.validate(full=True)
>>>

pitrou · 2023-06-27T18:04:04Z

@felipecrv @benibus Perhaps one of you is interested as well.

benibus · 2023-06-27T19:11:03Z

I can try taking a look ...might be nice to do some unicode debugging again

rok · 2023-06-30T11:51:26Z

I can take a look but need to complete something else first. Will ping if I start work.

raulcd · 2023-07-07T09:13:36Z

@pitrou should this be a blocker?

pitrou · 2023-07-07T14:05:18Z

I don't think it needs to.

benibus · 2023-07-07T17:21:41Z

Fix for this is in progress, but it seems the default value for SliceOptions::stop (which is INT64_MAX) isn't handled in a few calculations, resulting in overflows.

Seems kinda crazy that this hasn't come up before actually... since the specific uinicode sequence doesn't appear to be relevant.

### Rationale for this change The default value for the `SliceOptions::stop` is `INT64_MAX`, which isn't considered in several internal calculations - resulting in integer overflows and unexpected behavior when `stop` isn't provided. Also note that running the included tests without the fixes should result in ubsan errors (it did for me, at least). ### What changes are included in this PR? - Adds some logic to `SliceCodunitsTransform` that handles potential overflows - Adds tests for cases where the `start` param is positive/negative and `stop` is the maximum value **Update** Discovered that `utf8_slice_codeunits` deviates from Python array behavior when `stop=None` and `step < 0`, so further changes were made: - Handles `INT64_MIN` for `SliceOptions::stop` on C++ side, adds more tests. - Updates Python bindings for `SliceOptions` so that the default value when `stop=None` (`sys.maxsize`) is negated when `step < 0` - Adds `None` as a possible `stop` value in Python tests ### Are these changes tested? Yes (tests are included) ### Are there any user-facing changes? In theory, altering the behavior of `utf8_slice_codepoints` when `stop=None` and `step < 0` could be considered a breaking change. That being said, the current implementation produces incorrect results whenever `None` is even used, so it probably isn't one in practice... * Closes: #36311 Authored-by: benibus <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

wirable23 added the Type: bug label Jun 26, 2023

github-actions bot added the Component: Python label Jun 26, 2023

wirable23 mentioned this issue Jun 26, 2023

BUG: Arrow backed series can't be displayed after string slicing it pandas-dev/pandas#53833

Closed

3 tasks

pitrou added the Component: C++ label Jun 27, 2023

rtpsw added a commit to rtpsw/arrow that referenced this issue Jun 28, 2023

apacheGH-36311: [C++][CI] Sporadic errors in

e072ba6

github-actions bot mentioned this issue Jun 28, 2023

GH-36331: [C++][CI] Sporadic errors in AsofJoinTest #36356

Merged

pitrou assigned benibus Jun 28, 2023

pitrou added this to the 13.0.0 milestone Jun 29, 2023

raulcd changed the title ~~utf8_slice_codeunits produces invalid unicode sequence~~ [C++][Python] utf8_slice_codeunits produces invalid unicode sequence Jul 4, 2023

github-actions bot mentioned this issue Jul 9, 2023

GH-36311: [C++] Fix integer overflows in utf8_slice_codeunits #36575

Merged

pitrou closed this as completed in #36575 Jul 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][Python] utf8_slice_codeunits produces invalid unicode sequence #36311

[C++][Python] utf8_slice_codeunits produces invalid unicode sequence #36311

wirable23 commented Jun 26, 2023

pitrou commented Jun 27, 2023

pitrou commented Jun 27, 2023 •

edited

Loading

pitrou commented Jun 27, 2023

benibus commented Jun 27, 2023

rok commented Jun 30, 2023

raulcd commented Jul 7, 2023

pitrou commented Jul 7, 2023

benibus commented Jul 7, 2023

[C++][Python] utf8_slice_codeunits produces invalid unicode sequence #36311

[C++][Python] utf8_slice_codeunits produces invalid unicode sequence #36311

Comments

wirable23 commented Jun 26, 2023

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

pitrou commented Jun 27, 2023

pitrou commented Jun 27, 2023 • edited Loading

pitrou commented Jun 27, 2023

benibus commented Jun 27, 2023

rok commented Jun 30, 2023

raulcd commented Jul 7, 2023

pitrou commented Jul 7, 2023

benibus commented Jul 7, 2023

pitrou commented Jun 27, 2023 •

edited

Loading