Reading chunked `MapArray` fails for large variables (but works for smaller) #38513

slobodan-ilic · 2023-10-30T14:56:44Z

🐞 Describe the Bug

While using pyarrow to handle real-life survey data from our custom database at Crunch.io, we ran into an issue. Writing the data to parquet files works as expected, but reading the data back into a table using pq.read_table triggers an error. The error is dependent on the data size but seems related to nested types.

Platform & Version Info

Library: PyArrow
Language: Python
Environment: MacOS
Version: 13.0.0

⚠️ Error Message

Here is the traceback of the encountered error:

[Traceback (most recent call last):
  File "/Users/slobodanilic/Documents/Crunch/repos/crunch-lake/testmap.py", line 61, in <module>
    loaded_map_array = pq.read_table("test.parquet").column(0)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/slobodanilic/Documents/Crunch/repos/crunch-lake/venv/lib/python3.11/site-packages/pyarrow/parquet/core.py", line 3002, in read_table
    return dataset.read(columns=columns, use_threads=use_threads,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/slobodanilic/Documents/Crunch/repos/crunch-lake/venv/lib/python3.11/site-packages/pyarrow/parquet/core.py", line 2630, in read
    table = self._dataset.to_table(
            ^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/_dataset.pyx", line 556, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 3638, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Nested data conversions not implemented for chunked array outputs]

📄 Reproducible Code

Run the code as-is with 200K rows to reproduce the error. Reduce the row count to 100K, and it should work.

"""
Test writing/reading large map array in chunks.

This example demonstrates an issue when trying to encode real-life survey data results
into a map-array structure in pyarrow and saving it into a parquet file. Reading it back
raises an error: `Nested data conversions not implemented for chunked array outputs`.
"""

from typing import List
import numpy as np
from numpy import ndarray
import pyarrow as pa
import pyarrow.parquet as pq

# Parameters
N_ROWS: int = 200000  # changing this to 100K will make the example work
N_COLS: int = 600
SPARSITY: float = 0.5
CHUNK_SIZE: int = 10000

# Calculate sparsity-affected column size
N_COLS_W_VALUES: int = int(N_COLS * SPARSITY)

# Generate "column" names (or keys in MapArray context)
subrefs: List[str] = [
    f"really_really_really_long_column_name_for_a_subreference_{i}"
    for i in range(N_COLS)
]

# Generate an index array for column names
all_subrefs_inds: ndarray = np.arange(N_COLS)

# Generate actual data (random indices) for each row/column combination
subvar_indexes: ndarray = np.array(
    [
        np.random.choice(all_subrefs_inds, size=N_COLS_W_VALUES, replace=False)
        for _ in range(N_ROWS)
    ]
).ravel()

# Generate random values between 1 and 10 for each row/column combination
values: ndarray = np.random.randint(1, 11, size=(N_ROWS, N_COLS_W_VALUES)).ravel()

# Generate offsets for each row
offsets: ndarray = np.linspace(0, N_ROWS * N_COLS_W_VALUES, N_ROWS + 1, dtype=int)

# Create DictionaryArray for keys and MapArray for the map structure
keys = pa.DictionaryArray.from_arrays(pa.array(subvar_indexes), subrefs)
map_array = pa.chunked_array(
    [
        pa.MapArray.from_arrays(offsets[i : i + CHUNK_SIZE + 1], keys, pa.array(values))
        for i in range(0, len(offsets) - 1, CHUNK_SIZE)
    ]
)

# Write and read the table
print("Writing table")
tbl = pa.Table.from_arrays([map_array], names=["map_array"])
pq.write_table(tbl, "test.parquet")

print("Reading table")
loaded_map_array = pq.read_table("test.parquet").column(0)

print("Successfully read the table from parquet and loaded into pyarrow.")

🏷 Component(s)

Parquet
Python

The text was updated successfully, but these errors were encountered:

slobodan-ilic added the Type: bug label Oct 30, 2023

github-actions bot added Component: Parquet Component: Python labels Oct 30, 2023

slobodan-ilic mentioned this issue Oct 30, 2023

GH-32723: [C++][Parquet] Add option to use LARGE* variants of binary types #35825

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading chunked `MapArray` fails for large variables (but works for smaller) #38513

Reading chunked `MapArray` fails for large variables (but works for smaller) #38513

slobodan-ilic commented Oct 30, 2023 •

edited

Loading

Reading chunked MapArray fails for large variables (but works for smaller) #38513

Reading chunked MapArray fails for large variables (but works for smaller) #38513

Comments

slobodan-ilic commented Oct 30, 2023 • edited Loading

🐞 Describe the Bug

Platform & Version Info

⚠️ Error Message

📄 Reproducible Code

🏷 Component(s)

Reading chunked `MapArray` fails for large variables (but works for smaller) #38513

Reading chunked `MapArray` fails for large variables (but works for smaller) #38513

slobodan-ilic commented Oct 30, 2023 •

edited

Loading