You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While using pyarrow to handle real-life survey data from our custom database at Crunch.io, we ran into an issue. Writing the data to parquet files works as expected, but reading the data back into a table using pq.read_table triggers an error. The error is dependent on the data size but seems related to nested types.
Platform & Version Info
Library: PyArrow
Language: Python
Environment: MacOS
Version: 13.0.0
⚠️Error Message
Here is the traceback of the encountered error:
[Traceback (most recent call last):
File "/Users/slobodanilic/Documents/Crunch/repos/crunch-lake/testmap.py", line 61, in<module>
loaded_map_array = pq.read_table("test.parquet").column(0)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/slobodanilic/Documents/Crunch/repos/crunch-lake/venv/lib/python3.11/site-packages/pyarrow/parquet/core.py", line 3002, in read_table
return dataset.read(columns=columns, use_threads=use_threads,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/slobodanilic/Documents/Crunch/repos/crunch-lake/venv/lib/python3.11/site-packages/pyarrow/parquet/core.py", line 2630, inread
table = self._dataset.to_table(
^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/_dataset.pyx", line 556, in pyarrow._dataset.Dataset.to_table
File "pyarrow/_dataset.pyx", line 3638, in pyarrow._dataset.Scanner.to_table
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Nested data conversions not implemented for chunked array outputs]
📄 Reproducible Code
Run the code as-is with 200K rows to reproduce the error. Reduce the row count to 100K, and it should work.
"""Test writing/reading large map array in chunks.This example demonstrates an issue when trying to encode real-life survey data resultsinto a map-array structure in pyarrow and saving it into a parquet file. Reading it backraises an error: `Nested data conversions not implemented for chunked array outputs`."""fromtypingimportListimportnumpyasnpfromnumpyimportndarrayimportpyarrowaspaimportpyarrow.parquetaspq# ParametersN_ROWS: int=200000# changing this to 100K will make the example workN_COLS: int=600SPARSITY: float=0.5CHUNK_SIZE: int=10000# Calculate sparsity-affected column sizeN_COLS_W_VALUES: int=int(N_COLS*SPARSITY)
# Generate "column" names (or keys in MapArray context)subrefs: List[str] = [
f"really_really_really_long_column_name_for_a_subreference_{i}"foriinrange(N_COLS)
]
# Generate an index array for column namesall_subrefs_inds: ndarray=np.arange(N_COLS)
# Generate actual data (random indices) for each row/column combinationsubvar_indexes: ndarray=np.array(
[
np.random.choice(all_subrefs_inds, size=N_COLS_W_VALUES, replace=False)
for_inrange(N_ROWS)
]
).ravel()
# Generate random values between 1 and 10 for each row/column combinationvalues: ndarray=np.random.randint(1, 11, size=(N_ROWS, N_COLS_W_VALUES)).ravel()
# Generate offsets for each rowoffsets: ndarray=np.linspace(0, N_ROWS*N_COLS_W_VALUES, N_ROWS+1, dtype=int)
# Create DictionaryArray for keys and MapArray for the map structurekeys=pa.DictionaryArray.from_arrays(pa.array(subvar_indexes), subrefs)
map_array=pa.chunked_array(
[
pa.MapArray.from_arrays(offsets[i : i+CHUNK_SIZE+1], keys, pa.array(values))
foriinrange(0, len(offsets) -1, CHUNK_SIZE)
]
)
# Write and read the tableprint("Writing table")
tbl=pa.Table.from_arrays([map_array], names=["map_array"])
pq.write_table(tbl, "test.parquet")
print("Reading table")
loaded_map_array=pq.read_table("test.parquet").column(0)
print("Successfully read the table from parquet and loaded into pyarrow.")
🏷 Component(s)
Parquet
Python
The text was updated successfully, but these errors were encountered:
🐞 Describe the Bug
While using pyarrow to handle real-life survey data from our custom database at Crunch.io, we ran into an issue. Writing the data to parquet files works as expected, but reading the data back into a table using
pq.read_table
triggers an error. The error is dependent on the data size but seems related to nested types.Platform & Version Info
Here is the traceback of the encountered error:
📄 Reproducible Code
Run the code as-is with 200K rows to reproduce the error. Reduce the row count to 100K, and it should work.
🏷 Component(s)
The text was updated successfully, but these errors were encountered: