Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Passing table as argument to pyarrow.Table.filter causes segfault #37650

Closed
WillAyd opened this issue Sep 11, 2023 · 6 comments · Fixed by #38075
Closed

[Python] Passing table as argument to pyarrow.Table.filter causes segfault #37650

WillAyd opened this issue Sep 11, 2023 · 6 comments · Fixed by #38075

Comments

@WillAyd
Copy link
Contributor

WillAyd commented Sep 11, 2023

Describe the bug, including details regarding any error messages, version, and platform.

This is not correct code, but I did actually make this mistake and was puzzled for a while by the segfault (I thought I was passing a pyarrow.compute expression to filter)

>>> import pyarrow as pa
>>> n_legs = pa.array([2, 4, 5, 100])
>>> animals = pa.array(["Flamingo", "Horse", "Brittle stars", "Centipede"])
>>> names = ["n_legs", "animals"]
>>> tbl = pa.Table.from_arrays([n_legs, animals], names=names)
>>> tbl.filter(tbl)  # meant to pass an pyarrow.compute.filter here
Segmentation fault (core dumped)

A gdb backtrace shows the following steps:

#0  0x00007ffff5e5cfde in arrow::compute::internal::(anonymous namespace)::FilterMetaFunction::ExecuteImpl(std::vector<arrow::Datum, std::allocator<arrow::Datum> > const&, arrow::compute::FunctionOptions const*, arrow::compute::ExecContext*) const ()
   from /home/willayd/mambaforge/envs/scratchpad/lib/python3.11/site-packages/pyarrow/../../../libarrow.so.1200
#1  0x00007ffff5d45d22 in arrow::compute::MetaFunction::Execute(std::vector<arrow::Datum, std::allocator<arrow::Datum> > const&, arrow::compute::FunctionOptions const*, arrow::compute::ExecContext*) const ()
   from /home/willayd/mambaforge/envs/scratchpad/lib/python3.11/site-packages/pyarrow/../../../libarrow.so.1200
#2  0x00007fff8c91ea26 in __pyx_pf_7pyarrow_8_compute_8Function_6call(__pyx_obj_7pyarrow_8_compute_Function*, _object*, __pyx_obj_7pyarrow_8_compute_FunctionOptions*, __pyx_obj_7pyarrow_3lib_MemoryPool*, _object*) ()
   from /home/willayd/mambaforge/envs/scratchpad/lib/python3.11/site-packages/pyarrow/_compute.cpython-311-x86_64-linux-gnu.so
#3  0x00005555557614c6 in method_vectorcall_VARARGS_KEYWORDS (func=<optimized out>, args=<optimized out>, nargsf=<optimized out>, 
    kwnames=<optimized out>) at /usr/local/src/conda/python-3.11.0/Objects/descrobject.c:364
#4  0x000055555574bec1 in _PyObject_VectorcallTstate (kwnames=<optimized out>, nargsf=<optimized out>, args=<optimized out>, 
    callable=0x7fff8ca66fc0, tstate=0x555555ae0d98 <_PyRuntime+166328>) at /usr/local/src/conda/python-3.11.0/Include/internal/pycore_call.h:92
#5  PyObject_Vectorcall (callable=0x7fff8ca66fc0, args=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)
    at /usr/local/src/conda/python-3.11.0/Objects/call.c:299
#6  0x000055555574169e in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=<optimized out>, throwflag=<optimized out>)
    at /usr/local/src/conda/python-3.11.0/Python/ceval.c:4772
#7  0x0000555555767e81 in _PyEval_EvalFrame (throwflag=0, frame=0x7ffff7fb5080, tstate=0x555555ae0d98 <_PyRuntime+166328>)
    at /usr/local/src/conda/python-3.11.0/Include/internal/pycore_ceval.h:73
#8  _PyEval_Vector (kwnames=<optimized out>, argcount=3, args=<optimized out>, locals=0x0, func=0x7fff8c824540, 
    tstate=0x555555ae0d98 <_PyRuntime+166328>) at /usr/local/src/conda/python-3.11.0/Python/ceval.c:6428
#9  _PyFunction_Vectorcall (func=0x7fff8c824540, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)
    at /usr/local/src/conda/python-3.11.0/Objects/call.c:393
#10 0x00007ffff6ea0411 in __Pyx_PyObject_Call(_object*, _object*, _object*) ()
   from /home/willayd/mambaforge/envs/scratchpad/lib/python3.11/site-packages/pyarrow/lib.cpython-311-x86_64-linux-gnu.so
#11 0x00007ffff6f34c0d in __pyx_pw_7pyarrow_3lib_5Table_19filter(_object*, _object*, _object*) ()
   from /home/willayd/mambaforge/envs/scratchpad/lib/python3.11/site-packages/pyarrow/lib.cpython-311-x86_64-linux-gnu.so

Component(s)

Python

@mapleFU
Copy link
Member

mapleFU commented Sep 11, 2023

So seems that it's arrow-12.0? In 13.0 the code below will also core dump. Don't know if thats a bug or expected.

>>> tbl.filter(tbl)

Firstly, seems that the filter should be used as below:

>>> filter = pa.array([True, False, True, False])
>>> tbl.filter(filter)

@WillAyd
Copy link
Contributor Author

WillAyd commented Sep 11, 2023

Yea sorry this was arrow-12.0 . And noted that the filter was used incorrectly here to begin with - I would have just expected a TypeError instead of a segfault

@jorisvandenbossche jorisvandenbossche changed the title Passing table as argument to pyarrow.Table.filter causes segfault [Python] Passing table as argument to pyarrow.Table.filter causes segfault Sep 11, 2023
@WillAyd
Copy link
Contributor Author

WillAyd commented Sep 18, 2023

From a source build I was able to trace the segfault to this line:

const bool filter_is_plain_bool = filter_type.id() == Type::BOOL;

It looks like filter_type on the preceding line is nullptr. It looks like the ! and == operators are overloaded on the DataType in a way that you can't simply compare this to filter_type; I'm not sure if that is a problem with the construction of the DataType or a bad argument making its way into this call in the first place, but hope that context is helpful

@jorisvandenbossche
Copy link
Member

We are passing a generic Datum there as filter (args[1]), and then later assuming this will be an (CHUNKED_)ARRAY kind of Datum, but without actually checking this first (eg in FilterTable we actually check it later).
So I think we should add a check on the kind of Datum and raise a error if it's not an array-like.

pitrou added a commit that referenced this issue Oct 10, 2023
### Rationale for this change
Prevent a segfault upon passing a non-(chunked_)array object into `Table.filter`. See #37650.

### What changes are included in this PR?
1. Check filter Datum kind to make sure that it is an array or a chunked array
2. test that attempting to filter a table with another table raises a not implemented error

### Are these changes tested?
In PyArrow, yes

### Are there any user-facing changes?
Raises an error if a non-array or non-chunked_array object is passed into `Table.filter`

* Closes: #37650

Lead-authored-by: Patrick Clarke <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
@pitrou pitrou added this to the 14.0.0 milestone Oct 10, 2023
@pitrou
Copy link
Member

pitrou commented Oct 10, 2023

@p-a-a-a-trick Can you post a comment here so that we can assign the issue to you?

@p-a-a-a-trick
Copy link
Contributor

@pitrou sure thing

(I'll take this one (: )

JerAguilon pushed a commit to JerAguilon/arrow that referenced this issue Oct 23, 2023
…pache#38075)

### Rationale for this change
Prevent a segfault upon passing a non-(chunked_)array object into `Table.filter`. See apache#37650.

### What changes are included in this PR?
1. Check filter Datum kind to make sure that it is an array or a chunked array
2. test that attempting to filter a table with another table raises a not implemented error

### Are these changes tested?
In PyArrow, yes

### Are there any user-facing changes?
Raises an error if a non-array or non-chunked_array object is passed into `Table.filter`

* Closes: apache#37650

Lead-authored-by: Patrick Clarke <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…pache#38075)

### Rationale for this change
Prevent a segfault upon passing a non-(chunked_)array object into `Table.filter`. See apache#37650.

### What changes are included in this PR?
1. Check filter Datum kind to make sure that it is an array or a chunked array
2. test that attempting to filter a table with another table raises a not implemented error

### Are these changes tested?
In PyArrow, yes

### Are there any user-facing changes?
Raises an error if a non-array or non-chunked_array object is passed into `Table.filter`

* Closes: apache#37650

Lead-authored-by: Patrick Clarke <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…pache#38075)

### Rationale for this change
Prevent a segfault upon passing a non-(chunked_)array object into `Table.filter`. See apache#37650.

### What changes are included in this PR?
1. Check filter Datum kind to make sure that it is an array or a chunked array
2. test that attempting to filter a table with another table raises a not implemented error

### Are these changes tested?
In PyArrow, yes

### Are there any user-facing changes?
Raises an error if a non-array or non-chunked_array object is passed into `Table.filter`

* Closes: apache#37650

Lead-authored-by: Patrick Clarke <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants