Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Python] utf8_slice_codeunits produces invalid unicode sequence #36311

Closed
wirable23 opened this issue Jun 26, 2023 · 8 comments · Fixed by #36575
Closed

[C++][Python] utf8_slice_codeunits produces invalid unicode sequence #36311

wirable23 opened this issue Jun 26, 2023 · 8 comments · Fixed by #36575

Comments

@wirable23
Copy link

Describe the bug, including details regarding any error messages, version, and platform.

pa.compute.utf8_slice_codeunits(f"AB{chr(127917)}C{chr(127917)}ㇱD", start=2, stop=None, step=4).as_py()

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow\scalar.pxi", line 632, in pyarrow.lib.StringScalar.as_py
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 4: invalid start byte
>>>

The result of utf8_slice_codeunits produced an invalid unicode sequence.

Component(s)

Python

@pitrou
Copy link
Member

pitrou commented Jun 27, 2023

It gets even better in debug mode:

>>> import pyarrow.compute as pc
>>> import pyarrow as pa
>>> 
>>> pa.compute.utf8_slice_codeunits(f"AB{chr(127917)}C{chr(127917)}ㇱD", start=2, stop=None, step=4)
/home/antoine/arrow/dev/cpp/src/arrow/compute/kernels/scalar_string_internal.h:109:  Check failed: (output_ncodeunits) <= (max_output_ncodeunits) 

@rok Would you like to take a look?

@pitrou
Copy link
Member

pitrou commented Jun 27, 2023

Even a trivial identity slice fails:

>>> s = f"AB{chr(127917)}C{chr(127917)}ㇱD"
>>> a = pa.array([s])
>>> pc.utf8_slice_codeunits(a, start=0, stop=None)
Traceback (most recent call last):
  Cell In[11], line 1
    pc.utf8_slice_codeunits(a, start=0, stop=None)
  File ~/arrow/dev/python/pyarrow/compute.py:261 in wrapper
    return func.call(args, options, memory_pool)
  File pyarrow/_compute.pyx:367 in pyarrow._compute.Function.call
    result = GetResultValue(
  File pyarrow/error.pxi:144 in pyarrow.lib.pyarrow_internal_check_status
    return check_status(status)
  File pyarrow/error.pxi:100 in pyarrow.lib.check_status
    raise ArrowInvalid(message)
ArrowInvalid: Negative buffer resize: -4
/home/antoine/arrow/dev/cpp/src/arrow/memory_pool.cc:931  buffer->Resize(size)
/home/antoine/arrow/dev/cpp/src/arrow/compute/kernels/scalar_string_internal.h:88  ctx->Allocate(max_output_ncodeunits)
/home/antoine/arrow/dev/cpp/src/arrow/compute/exec.cc:920  kernel_->exec(kernel_ctx_, input, &output)
/home/antoine/arrow/dev/cpp/src/arrow/compute/function.cc:276  executor->Execute(input, &listener)

Note that the input is valid:

>>> a.validate(full=True)
>>>

@pitrou
Copy link
Member

pitrou commented Jun 27, 2023

@felipecrv @benibus Perhaps one of you is interested as well.

@benibus
Copy link
Collaborator

benibus commented Jun 27, 2023

I can try taking a look ...might be nice to do some unicode debugging again

@rok
Copy link
Member

rok commented Jun 30, 2023

I can take a look but need to complete something else first. Will ping if I start work.

@raulcd raulcd changed the title utf8_slice_codeunits produces invalid unicode sequence [C++][Python] utf8_slice_codeunits produces invalid unicode sequence Jul 4, 2023
@raulcd
Copy link
Member

raulcd commented Jul 7, 2023

@pitrou should this be a blocker?

@pitrou
Copy link
Member

pitrou commented Jul 7, 2023

I don't think it needs to.

@benibus
Copy link
Collaborator

benibus commented Jul 7, 2023

Fix for this is in progress, but it seems the default value for SliceOptions::stop (which is INT64_MAX) isn't handled in a few calculations, resulting in overflows.

Seems kinda crazy that this hasn't come up before actually... since the specific uinicode sequence doesn't appear to be relevant.

pitrou pushed a commit that referenced this issue Jul 11, 2023
### Rationale for this change

The default value for the `SliceOptions::stop` is `INT64_MAX`, which isn't considered in several internal calculations - resulting in integer overflows and unexpected behavior when `stop` isn't provided.

Also note that running the included tests without the fixes should result in ubsan errors (it did for me, at least).

### What changes are included in this PR?

- Adds some logic to `SliceCodunitsTransform` that handles potential overflows
- Adds tests for cases where the `start` param is positive/negative and `stop` is the maximum value

**Update**
Discovered that `utf8_slice_codeunits` deviates from Python array behavior when `stop=None` and `step < 0`, so further changes were made:
- Handles `INT64_MIN` for `SliceOptions::stop` on C++ side, adds more tests.
- Updates Python bindings for `SliceOptions` so that the default value when `stop=None` (`sys.maxsize`) is negated when `step < 0`
- Adds `None` as a possible `stop` value in Python tests

### Are these changes tested?

Yes (tests are included)

### Are there any user-facing changes?

In theory, altering the behavior of `utf8_slice_codepoints` when `stop=None` and `step < 0` could be considered a breaking change. That being said, the current implementation produces incorrect results whenever `None` is even used, so it probably isn't one in practice...

* Closes: #36311

Authored-by: benibus <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
raulcd pushed a commit that referenced this issue Jul 11, 2023
### Rationale for this change

The default value for the `SliceOptions::stop` is `INT64_MAX`, which isn't considered in several internal calculations - resulting in integer overflows and unexpected behavior when `stop` isn't provided.

Also note that running the included tests without the fixes should result in ubsan errors (it did for me, at least).

### What changes are included in this PR?

- Adds some logic to `SliceCodunitsTransform` that handles potential overflows
- Adds tests for cases where the `start` param is positive/negative and `stop` is the maximum value

**Update**
Discovered that `utf8_slice_codeunits` deviates from Python array behavior when `stop=None` and `step < 0`, so further changes were made:
- Handles `INT64_MIN` for `SliceOptions::stop` on C++ side, adds more tests.
- Updates Python bindings for `SliceOptions` so that the default value when `stop=None` (`sys.maxsize`) is negated when `step < 0`
- Adds `None` as a possible `stop` value in Python tests

### Are these changes tested?

Yes (tests are included)

### Are there any user-facing changes?

In theory, altering the behavior of `utf8_slice_codepoints` when `stop=None` and `step < 0` could be considered a breaking change. That being said, the current implementation produces incorrect results whenever `None` is even used, so it probably isn't one in practice...

* Closes: #36311

Authored-by: benibus <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants