Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading chunked MapArray fails for large variables (but works for smaller) #38513

Open
slobodan-ilic opened this issue Oct 30, 2023 · 0 comments

Comments

@slobodan-ilic
Copy link
Contributor

slobodan-ilic commented Oct 30, 2023

🐞 Describe the Bug

While using pyarrow to handle real-life survey data from our custom database at Crunch.io, we ran into an issue. Writing the data to parquet files works as expected, but reading the data back into a table using pq.read_table triggers an error. The error is dependent on the data size but seems related to nested types.


Platform & Version Info

  • Library: PyArrow
  • Language: Python
  • Environment: MacOS
  • Version: 13.0.0

⚠️ Error Message

Here is the traceback of the encountered error:

[Traceback (most recent call last):
  File "/Users/slobodanilic/Documents/Crunch/repos/crunch-lake/testmap.py", line 61, in <module>
    loaded_map_array = pq.read_table("test.parquet").column(0)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/slobodanilic/Documents/Crunch/repos/crunch-lake/venv/lib/python3.11/site-packages/pyarrow/parquet/core.py", line 3002, in read_table
    return dataset.read(columns=columns, use_threads=use_threads,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/slobodanilic/Documents/Crunch/repos/crunch-lake/venv/lib/python3.11/site-packages/pyarrow/parquet/core.py", line 2630, in read
    table = self._dataset.to_table(
            ^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/_dataset.pyx", line 556, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 3638, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Nested data conversions not implemented for chunked array outputs]

📄 Reproducible Code

Run the code as-is with 200K rows to reproduce the error. Reduce the row count to 100K, and it should work.

"""
Test writing/reading large map array in chunks.

This example demonstrates an issue when trying to encode real-life survey data results
into a map-array structure in pyarrow and saving it into a parquet file. Reading it back
raises an error: `Nested data conversions not implemented for chunked array outputs`.
"""

from typing import List
import numpy as np
from numpy import ndarray
import pyarrow as pa
import pyarrow.parquet as pq

# Parameters
N_ROWS: int = 200000  # changing this to 100K will make the example work
N_COLS: int = 600
SPARSITY: float = 0.5
CHUNK_SIZE: int = 10000

# Calculate sparsity-affected column size
N_COLS_W_VALUES: int = int(N_COLS * SPARSITY)

# Generate "column" names (or keys in MapArray context)
subrefs: List[str] = [
    f"really_really_really_long_column_name_for_a_subreference_{i}"
    for i in range(N_COLS)
]

# Generate an index array for column names
all_subrefs_inds: ndarray = np.arange(N_COLS)

# Generate actual data (random indices) for each row/column combination
subvar_indexes: ndarray = np.array(
    [
        np.random.choice(all_subrefs_inds, size=N_COLS_W_VALUES, replace=False)
        for _ in range(N_ROWS)
    ]
).ravel()

# Generate random values between 1 and 10 for each row/column combination
values: ndarray = np.random.randint(1, 11, size=(N_ROWS, N_COLS_W_VALUES)).ravel()

# Generate offsets for each row
offsets: ndarray = np.linspace(0, N_ROWS * N_COLS_W_VALUES, N_ROWS + 1, dtype=int)

# Create DictionaryArray for keys and MapArray for the map structure
keys = pa.DictionaryArray.from_arrays(pa.array(subvar_indexes), subrefs)
map_array = pa.chunked_array(
    [
        pa.MapArray.from_arrays(offsets[i : i + CHUNK_SIZE + 1], keys, pa.array(values))
        for i in range(0, len(offsets) - 1, CHUNK_SIZE)
    ]
)

# Write and read the table
print("Writing table")
tbl = pa.Table.from_arrays([map_array], names=["map_array"])
pq.write_table(tbl, "test.parquet")

print("Reading table")
loaded_map_array = pq.read_table("test.parquet").column(0)

print("Successfully read the table from parquet and loaded into pyarrow.")

🏷 Component(s)

  • Parquet
  • Python
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant