[C++][Python] write_dataset segfaults for reasonably large tables containing FixedSizeListArrays #43696

sjperkins · 2024-08-14T17:24:20Z

Describe the bug, including details regarding any error messages, version, and platform.

platform: Ubuntu 22.04, x86_64
python: 3.11
pyarrow: 18.0.0

import os
import numpy as np
import pyarrow as pa
import pyarrow.dataset as pad
import pyarrow.parquet as pq
import tempfile

id1 = np.arange(40)[:, None, None]
id2 = np.arange(50)[None, :, None]
id3 = np.arange(100)[None, None, :]
cell_shape = (6, 15)

id1, id2, id3 = map(np.ravel, np.broadcast_arrays(id1, id2, id3))
nrow, = id3.shape
data = pa.array(np.arange(nrow * np.prod(cell_shape)))
data = pa.FixedSizeListArray.from_arrays(data, cell_shape[-1])
data = pa.FixedSizeListArray.from_arrays(data, cell_shape[-2])
assert len(data) == nrow
T = pa.Table.from_pydict({"ID1": id1, "ID2": id2, "ID3": id3, "DATA": data})
print(f"{T.nbytes / (1024.**2)}MB")

with tempfile.TemporaryDirectory() as dir:
    # Succeeds
    pq.write_table(T, dir + os.path.sep + "test.parquet")
    print("Wrote parquet file")

with tempfile.TemporaryDirectory() as dir:
    partition_fields = [T.schema.field(c) for c in ("ID1", "ID2")]
    partition = pad.partitioning(pa.schema(partition_fields), flavor="hive")
    # Segfaults
    pad.write_dataset(T, dir, partitioning=partition,
                    format="parquet")

produces the following type of core dump:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  __pthread_kill_implementation (no_tid=0, signo=11, threadid=139680878245440)
    at ./nptl/pthread_kill.c:44
44      ./nptl/pthread_kill.c: No such file or directory.
[Current thread is 1 (Thread 0x7f09fd213640 (LWP 211153))]
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=11, threadid=139680878245440)
    at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=11, threadid=139680878245440)
    at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=139680878245440, signo=signo@entry=11)
    at ./nptl/pthread_kill.c:89
#3  0x00007f0a34442476 in __GI_raise (sig=11) at ../sysdeps/posix/raise.c:26
#4  <signal handler called>
#5  __memmove_avx_unaligned_erms ()
    at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:513
#6  0x00007f0a2fd0c0b9 in arrow::compute::internal::FixedWidthTakeExec(arrow::compute::KernelContext*, arrow::compute::ExecSpan const&, arrow::compute::ExecResult*) ()
   from /home/simon/venv/arcaedev/lib/python3.11/site-packages/pyarrow/libarrow.so.1700
#7  0x00007f0a2fd20b2e in arrow::compute::internal::FSLTakeExec(arrow::compute::KernelContext*, arrow::compute::ExecSpan const&, arrow::compute::ExecResult*) ()
   from /home/simon/venv/arcaedev/lib/python3.11/site-packages/pyarrow/libarrow.so.1700
#8  0x00007f0a2fe0e8a3 in arrow::compute::detail::(anonymous namespace)::VectorExecutor::Exec(arrow::compute::ExecSpan const&, arrow::compute::detail::ExecListener*) ()
   from /home/simon/venv/arcaedev/lib/python3.11/site-packages/pyarrow/libarrow.so.1700
#9  0x00007f0a2fe0ec42 in arrow::compute::detail::(anonymous namespace)::VectorExecutor::Execute(arrow::compute::ExecBatch const&, arrow::compute::detail::ExecListener*) ()
   from /home/simon/venv/arcaedev/lib/python3.11/site-packages/pyarrow/libarrow.so.1700
#10 0x00007f0a2fe2597b in arrow::compute::detail::FunctionExecutorImpl::Execute(std::vector<arrow::Datum, std::allocator<arrow::Datum> > const&, long) ()

The __memmove_avx_unaligned_erms call looks like it could be the trigger. Also note that calling pyarrow.parquet.write_table on the same table succeeds.

write_dataset seems to succeed if the ID* ranges are made smaller.

Component(s)

C++, Python

The text was updated successfully, but these errors were encountered:

vkhodygo · 2024-08-19T12:46:35Z

Can't reproduce the error with pyarrow 16.1

P.S. updated the package, can't reproduce it with 17.0 either, but had to terminate the code after 11 minutes of running; it "never" ends.

sjperkins · 2024-08-19T16:18:29Z

Can't reproduce the error with pyarrow 16.1

Yes, it succeeds for me on 16.0.0.

sjperkins · 2024-11-03T11:41:48Z

Updated the issue to reflect that this still occurs on 18.0.0

sjperkins added the Type: bug label Aug 14, 2024

github-actions bot added Component: C++ Component: Python labels Aug 14, 2024

sjperkins changed the title ~~write_dataset segfaults for reasonable large tables containing FixedSizeListArrays~~ write_dataset segfaults for reasonably large tables containing FixedSizeListArrays Aug 14, 2024

sjperkins mentioned this issue Aug 14, 2024

Upgrade to pyarrow 17.0.0 ratt-ru/arcae#104

Open

sjperkins changed the title ~~write_dataset segfaults for reasonably large tables containing FixedSizeListArrays~~ [C++][Python] write_dataset segfaults for reasonably large tables containing FixedSizeListArrays Aug 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][Python] write_dataset segfaults for reasonably large tables containing FixedSizeListArrays #43696

[C++][Python] write_dataset segfaults for reasonably large tables containing FixedSizeListArrays #43696

sjperkins commented Aug 14, 2024 •

edited

Loading

vkhodygo commented Aug 19, 2024 •

edited

Loading

sjperkins commented Aug 19, 2024

sjperkins commented Nov 3, 2024

[C++][Python] write_dataset segfaults for reasonably large tables containing FixedSizeListArrays #43696

[C++][Python] write_dataset segfaults for reasonably large tables containing FixedSizeListArrays #43696

Comments

sjperkins commented Aug 14, 2024 • edited Loading

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

vkhodygo commented Aug 19, 2024 • edited Loading

sjperkins commented Aug 19, 2024

sjperkins commented Nov 3, 2024

sjperkins commented Aug 14, 2024 •

edited

Loading

vkhodygo commented Aug 19, 2024 •

edited

Loading