Error when Z Ordering a larger dataset #1459

MrPowers · 2023-06-13T22:03:02Z

Environment

Delta-rs version: 0.10.0

Binding: Python

Environment:

Cloud provider: Localhost
OS: Macbook
Other: Macbook M1 with 64 GB of RAM

Bug

What happened: Z Order command worked on 5 GB h2o groupby dataset (1e8), but errors out of 50 GB dataset (1e9)

What you expected to happen: I expected the Z Ordering to work

How to reproduce it: This notebook shows the computations working well on the 1e8 dataset, but erroring out on the 1e9 dataset.

More details: I'm Z Ordering on a single column. Here's the error message:

thread 'tokio-runtime-worker' panicked at 'overflow', /Users/runner/.cargo/registry/src/github.com-1ecc6299db9ec823/arrow-select-39.0.0/src/interleave.rs:172:56
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
---------------------------------------------------------------------------
DeltaError                                Traceback (most recent call last)
File <timed eval>:1

File ~/opt/miniconda3/envs/deltalake-0100/lib/python3.9/site-packages/deltalake/table.py:697, in TableOptimizer.z_order(self, columns, partition_filters, target_size, max_concurrent_tasks)
    675 def z_order(
    676     self,
    677     columns: Iterable[str],
   (...)
    680     max_concurrent_tasks: Optional[int] = None,
    681 ) -> Dict[str, Any]:
    682     """
    683     Reorders the data using a Z-order curve to improve data skipping.
    684 
   (...)
    695     :return: the metrics from optimize
    696     """
--> 697     metrics = self.table._table.z_order_optimize(
    698         list(columns), partition_filters, target_size, max_concurrent_tasks
    699     )
    700     self.table.update_incremental()
    701     return json.loads(metrics)

DeltaError: Generic error: task 334 panicked

The text was updated successfully, but these errors were encountered:

wjones127 · 2023-06-13T22:23:00Z

Got it. That's likely from this line:

https://github.com/apache/arrow-rs/blob/700bd334ad8be53455f5dd80023b6c8c237559a7/arrow-select/src/interleave.rs#L172

Which indicates that we have a binary column that is too large for StringArray. Maybe we can use large types for this? Or else interleave the results in chunks.

MrPowers · 2023-06-13T22:28:09Z

Here's a few rows of the data:

	id1	id2	id3	id4	id5	id6	v1	v2	v3
0	id016	id046	id0000109363	88	13	146094	4	6	18.837686
1	id039	id087	id0000466766	14	30	111330	4	14	46.797328
2	id047	id098	id0000307804	85	23	187639	3	5	47.577311
3	id043	id017	id0000344864	87	76	256509	2	5	80.462924
4	id054	id027	id0000433679	99	67	32736	1	7	15.796662

Here are the dtypes:

id1     object
id2     object
id3     object
id4      int64
id5      int64
id6      int64
v1       int64
v2       int64
v3     float64
dtype: object

Here's the code I ran: dt.optimize.z_order(["id1"])

Should have included these details in the initial bug report.

…1461) # Description Fixes the base implementation so that is doesn't materialize the entire result in one record batch. It will still require materializing the full input for each partition in memory. This is mostly a problem for unpartitioned table, since that means materializing the entire table in memory. Adds a new datafusion-based implementation enabled by the `datafusion` feature. In theory, this should support spilling to disk. # Related Issue(s) For example: - closes #1459 - closes #1460 # Documentation

MrPowers added the bug Something isn't working label Jun 13, 2023

wjones127 mentioned this issue Jun 14, 2023

feat: handle larger z-order jobs with streaming output and spilling #1461

Merged

wjones127 closed this as completed in #1461 Jul 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when Z Ordering a larger dataset #1459

Error when Z Ordering a larger dataset #1459

MrPowers commented Jun 13, 2023

wjones127 commented Jun 13, 2023

MrPowers commented Jun 13, 2023

Error when Z Ordering a larger dataset #1459

Error when Z Ordering a larger dataset #1459

Comments

MrPowers commented Jun 13, 2023

Environment

Bug

wjones127 commented Jun 13, 2023

MrPowers commented Jun 13, 2023