Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Python] write_dataset segfaults for reasonably large tables containing FixedSizeListArrays #43696

Open
sjperkins opened this issue Aug 14, 2024 · 3 comments

Comments

@sjperkins
Copy link
Contributor

sjperkins commented Aug 14, 2024

Describe the bug, including details regarding any error messages, version, and platform.

platform: Ubuntu 22.04, x86_64
python: 3.11
pyarrow: 18.0.0

import os
import numpy as np
import pyarrow as pa
import pyarrow.dataset as pad
import pyarrow.parquet as pq
import tempfile

id1 = np.arange(40)[:, None, None]
id2 = np.arange(50)[None, :, None]
id3 = np.arange(100)[None, None, :]
cell_shape = (6, 15)

id1, id2, id3 = map(np.ravel, np.broadcast_arrays(id1, id2, id3))
nrow, = id3.shape
data = pa.array(np.arange(nrow * np.prod(cell_shape)))
data = pa.FixedSizeListArray.from_arrays(data, cell_shape[-1])
data = pa.FixedSizeListArray.from_arrays(data, cell_shape[-2])
assert len(data) == nrow
T = pa.Table.from_pydict({"ID1": id1, "ID2": id2, "ID3": id3, "DATA": data})
print(f"{T.nbytes / (1024.**2)}MB")

with tempfile.TemporaryDirectory() as dir:
    # Succeeds
    pq.write_table(T, dir + os.path.sep + "test.parquet")
    print("Wrote parquet file")

with tempfile.TemporaryDirectory() as dir:
    partition_fields = [T.schema.field(c) for c in ("ID1", "ID2")]
    partition = pad.partitioning(pa.schema(partition_fields), flavor="hive")
    # Segfaults
    pad.write_dataset(T, dir, partitioning=partition,
                    format="parquet")

produces the following type of core dump:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  __pthread_kill_implementation (no_tid=0, signo=11, threadid=139680878245440)
    at ./nptl/pthread_kill.c:44
44      ./nptl/pthread_kill.c: No such file or directory.
[Current thread is 1 (Thread 0x7f09fd213640 (LWP 211153))]
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=11, threadid=139680878245440)
    at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=11, threadid=139680878245440)
    at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=139680878245440, signo=signo@entry=11)
    at ./nptl/pthread_kill.c:89
#3  0x00007f0a34442476 in __GI_raise (sig=11) at ../sysdeps/posix/raise.c:26
#4  <signal handler called>
#5  __memmove_avx_unaligned_erms ()
    at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:513
#6  0x00007f0a2fd0c0b9 in arrow::compute::internal::FixedWidthTakeExec(arrow::compute::KernelContext*, arrow::compute::ExecSpan const&, arrow::compute::ExecResult*) ()
   from /home/simon/venv/arcaedev/lib/python3.11/site-packages/pyarrow/libarrow.so.1700
#7  0x00007f0a2fd20b2e in arrow::compute::internal::FSLTakeExec(arrow::compute::KernelContext*, arrow::compute::ExecSpan const&, arrow::compute::ExecResult*) ()
   from /home/simon/venv/arcaedev/lib/python3.11/site-packages/pyarrow/libarrow.so.1700
#8  0x00007f0a2fe0e8a3 in arrow::compute::detail::(anonymous namespace)::VectorExecutor::Exec(arrow::compute::ExecSpan const&, arrow::compute::detail::ExecListener*) ()
   from /home/simon/venv/arcaedev/lib/python3.11/site-packages/pyarrow/libarrow.so.1700
#9  0x00007f0a2fe0ec42 in arrow::compute::detail::(anonymous namespace)::VectorExecutor::Execute(arrow::compute::ExecBatch const&, arrow::compute::detail::ExecListener*) ()
   from /home/simon/venv/arcaedev/lib/python3.11/site-packages/pyarrow/libarrow.so.1700
#10 0x00007f0a2fe2597b in arrow::compute::detail::FunctionExecutorImpl::Execute(std::vector<arrow::Datum, std::allocator<arrow::Datum> > const&, long) ()

The __memmove_avx_unaligned_erms call looks like it could be the trigger. Also note that calling pyarrow.parquet.write_table on the same table succeeds.

write_dataset seems to succeed if the ID* ranges are made smaller.

Component(s)

C++, Python

@sjperkins sjperkins changed the title write_dataset segfaults for reasonable large tables containing FixedSizeListArrays write_dataset segfaults for reasonably large tables containing FixedSizeListArrays Aug 14, 2024
@sjperkins sjperkins changed the title write_dataset segfaults for reasonably large tables containing FixedSizeListArrays [C++][Python] write_dataset segfaults for reasonably large tables containing FixedSizeListArrays Aug 14, 2024
@vkhodygo
Copy link

vkhodygo commented Aug 19, 2024

Can't reproduce the error with pyarrow 16.1

P.S. updated the package, can't reproduce it with 17.0 either, but had to terminate the code after 11 minutes of running; it "never" ends.

@sjperkins
Copy link
Contributor Author

Can't reproduce the error with pyarrow 16.1

Yes, it succeeds for me on 16.0.0.

@sjperkins
Copy link
Contributor Author

Updated the issue to reflect that this still occurs on 18.0.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants