Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] pyarrow.compute.utf8_slice_codeunits fails when stop=None #14991

Closed
mroeschke opened this issue Dec 16, 2022 · 2 comments
Closed

[Python] pyarrow.compute.utf8_slice_codeunits fails when stop=None #14991

mroeschke opened this issue Dec 16, 2022 · 2 comments

Comments

@mroeschke
Copy link
Contributor

Describe the bug, including details regarding any error messages, version, and platform.

In [8]: import pyarrow as pa; import pyarrow.compute as pc

In [9]: pa.__version__
Out[9]: '10.0.1'

In [10]: arr = pa.array(["abcd"])

In [11]: pc.utf8_slice_codeunits(arr, 0)
---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
Cell In[11], line 1
----> 1 pc.utf8_slice_codeunits(arr, 0)

File ~/opt/miniconda3/envs/pyarrow_10/lib/python3.10/site-packages/pyarrow/compute.py:255, in _make_generic_wrapper.<locals>.wrapper(memory_pool, options, *args, **kwargs)
    253 if args and isinstance(args[0], Expression):
    254     return Expression._call(func_name, list(args), options)
--> 255 return func.call(args, options, memory_pool)

File ~/opt/miniconda3/envs/pyarrow_10/lib/python3.10/site-packages/pyarrow/_compute.pyx:355, in pyarrow._compute.Function.call()

File ~/opt/miniconda3/envs/pyarrow_10/lib/python3.10/site-packages/pyarrow/error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()

File ~/opt/miniconda3/envs/pyarrow_10/lib/python3.10/site-packages/pyarrow/error.pxi:100, in pyarrow.lib.check_status()

ArrowInvalid: Negative buffer resize: -4

In [12]: "abcd"[0:]
Out[12]: 'abcd'

Based on the stop parameter docs

If given, index to stop slicing at (exclusive). If not given, slicing will stop at the end.

So I would have expected that In [12]: is an equivalent operation to in [11]

Component(s)

Python

@AlenkaF AlenkaF changed the title BUG: pyarrow.compute.utf8_slice_codeunits fails when stop=None [Python] pyarrow.compute.utf8_slice_codeunits fails when stop=None Dec 19, 2022
@jorisvandenbossche
Copy link
Member

Using my development version to get a bit more informative traceback:

ArrowInvalid: Negative buffer resize: -4
/home/joris/scipy/repos/arrow/cpp/src/arrow/memory_pool.cc:931  buffer->Resize(size)
/home/joris/scipy/repos/arrow/cpp/src/arrow/compute/kernels/scalar_string_internal.h:88  ctx->Allocate(max_output_ncodeunits)
/home/joris/scipy/repos/arrow/cpp/src/arrow/compute/exec.cc:920  kernel_->exec(kernel_ctx_, input, &output)
/home/joris/scipy/repos/arrow/cpp/src/arrow/compute/function.cc:276  executor->Execute(input, &listener)

So if max_output_ncodeunits is -4, we might have run into some integer overflow while calculating that value:

struct SliceCodeunitsTransform : StringSliceTransformBase {
int64_t MaxCodeunits(int64_t ninputs, int64_t input_ncodeunits) override {
const SliceOptions& opt = *this->options;
if ((opt.start >= 0) != (opt.stop >= 0)) {
// If start and stop don't have the same sign, we can't guess an upper bound
// on the resulting slice lengths, so return a worst case estimate.
return input_ncodeunits;
}
int64_t max_slice_codepoints = (opt.stop - opt.start + opt.step - 1) / opt.step;
// The maximum UTF8 byte size of a codepoint is 4
return std::min(input_ncodeunits,
4 * ninputs * std::max<int64_t>(0, max_slice_codepoints));
}

Reproducing that logic in python:

In [11]: import sys

In [12]: stop = np.int64(sys.maxsize)

In [13]: start = np.int64(0)

In [14]: step = np.int64(1)

In [19]: max_slice_codepoints = (stop - start + step - 1) // step
<ipython-input-19-0fd4a0c6e713>:1: RuntimeWarning: overflow encountered in scalar add
  max_slice_codepoints = (stop - start + step - 1) // step
<ipython-input-19-0fd4a0c6e713>:1: RuntimeWarning: overflow encountered in scalar subtract
  max_slice_codepoints = (stop - start + step - 1) // step

In [20]: max_slice_codepoints
Out[20]: 9223372036854775807

In [21]: 4 * max_slice_codepoints
<ipython-input-21-240e76cab6f7>:1: RuntimeWarning: overflow encountered in scalar multiply
  4 * max_slice_codepoints
Out[21]: -4

So indeed multiple steps here are overflowing. We will need to refactor this calculation a bit (there are utilities like MultiplyWithOverflow to do overflow safe calculations that could be used here)

@pitrou
Copy link
Member

pitrou commented Jul 11, 2023

This should have been fixed by #36575

@pitrou pitrou closed this as completed Jul 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants