Prune projected nested columns through UNNEST #3925

JamesRTaylor · 2020-06-04T20:48:50Z

When CROSS JOIN UNNEST is used to access some columns within an array of rows, the entire row is projected as opposed to only the subset of columns based on the references. For example:

create table nest_test(id int, a array(row(x varchar, y varchar))) WITH (format='PARQUET')"
select x from nest_test cross join unnest(a)

Only a.x should be projected, but currently all columns of a are projected. For very wide rows, this can be expensive.

The text was updated successfully, but these errors were encountered:

phd3 · 2020-06-05T00:22:34Z

@JamesRTaylor I think there're multiple steps to this one:

Prune unnest fields in UNNEST node. This is similar to pruning symbols for single-level nesting. For multi-level nesting, this is more like pushing down dereferences.
for example, convert

Project ( p1 := f(A.B.C), p2 := m(K) )
	unnest(arr -> {A, E, K})
		S(arr)

to

	Project(p1 := f(A_B_C), p2 := m(K))
		unnest(arr' -> {A_B_C, K})
			project(arr' -> transform(arr, el -> row(el.A.B.C, el.K)))
                                 S(arr)

Here E is completely pruned, and A.B.C is extracted from A.

Push this array transformation as far down as possible in the plan node so that they can reach TableScan. We might be able to represent it through a new subclass of LambdaExpression to simplify things.
Add support for generic function calls in ConnectorExpression. (If this is too involved, may be we can start with creating a connector expression for the new subclass as mentioned above.) Then the ORC/Parquet reader would be able to prune the fields from the array rows.

I feel 3) won't be impactful enough without 1) and 2) for all query shapes. I've a WIP patch for 1) but needs some more work. But my hunch is, that patch, if checked-in alone, may degrade performance since UNNEST operator avoids data copy as much as possible already.

phd3 · 2020-06-05T17:01:55Z

@martint what are your thoughts on this?

JamesRTaylor · 2020-06-05T21:38:23Z

Thanks for the write-up, @phd3. I was thinking (1) and (2) plus using a ConnectorExpression that would contain pruned type definitions. I'd guess that the main improvement would come from the Parquet/ORC reader only reading A_B_C instead of all of A (very similar to the recent improvements for other nested data situations). Would it work to use the Variable class with a type that includes only the referenced sub fields?

I'd love to take a look at your patch. I could benchmark it on some production queries on our end if it's far enough along so we get an idea of potential impact.

martint · 2020-06-06T00:22:08Z

The current plan to support this is the transformation you described in (1), plus the ability to push down functions (e.g., transform) and lambda expressions to the connector. Before we can do that, we need to land some of the refactorings we've been making to the way functions are represented in the plan and add some APIs to be able to communicate and obtain information about them between the connector and the engine.

phd3 · 2020-06-09T18:15:14Z

@JamesRTaylor sure. I need to understand the case of NULLs in unnest. i.e. make sure that the number of rows output by unnest(a) is exactly the same as after pruning a. I'll create a WIP PR by next week.

gjhkael · 2021-09-13T05:43:26Z

any update？

mathfool · 2024-03-19T10:05:10Z

Any update on this one?

mixermt · 2024-04-02T11:33:34Z

2024 and most advanced query execution engines still can't read nested data efficiently 🤷‍♂️

Desmeister · 2024-04-02T20:57:07Z

Quick update. I had a stable build, but was continuously refactoring on top of Martin's recent AST/IR changes so I opted to wait on completion of those. I'm not expecting large changes after the latest major PR, so I'm fixing up the new classes and unit tests based on the new IR-friendly syntax.

rotem-ad · 2024-04-08T11:27:29Z

@martint @Desmeister This feature can be a serious game changer for our main use-case.
We have a very large Iceberg table (~40TB of compressed ingested data per day).
The table's schema includes many structured columns - including multi-level nesting (e.g: ARRAY(ROW(col_a,col_b,..,ARRAY(ROW(col_1,col_2,col_3,...)))) ).
Some of the ROW structs includes hundreds of nested columns, and in total there are a few thousands of nested columns in the table.

When running simple analytical queries with UNNEST operations against this table, which include only a tiny subset of the nested columns, Trino simply can't handle it - queries always fail with Query exceeded per-node memory limit.. exception.
Enabling fault-tolerant execution for those queries doesn't help as well and they fail after a while.

The only workaround we've come up with so far is 'tricking' Trino into thinking the table only has the subset of the columns required by the query. We do this by creating a Hive external table with the modified schema on top of the Iceberg table's data folder.
Running the same queries against the external table is super-fast! However, this workaround has too many caveats and obviously cannot be used in production.

Do you happen to know if this feature is planned to be merged any time soon?

bentzimortaboola · 2024-04-10T18:34:56Z

@Desmeister Can you provide ETA for PR?

findepi added the enhancement New feature or request label Jun 5, 2020

phd3 mentioned this issue Aug 14, 2020

Prune unnest mappings using lambda expressions #4270

Closed

martint assigned Desmeister Feb 15, 2024

Desmeister closed this as completed Apr 2, 2024

Desmeister reopened this Apr 2, 2024

zhaner08 mentioned this issue May 13, 2024

Support pushing dereferences within lambdas into table scan #21957

Closed

rmarrowstone mentioned this issue Aug 20, 2024

Prune Non-referenced Fields from Nested RowTypes #23074

Closed

zhaner08 mentioned this issue Aug 28, 2024

Support pushing dereferences within lambdas into table scan #23148

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prune projected nested columns through UNNEST #3925

Prune projected nested columns through UNNEST #3925

JamesRTaylor commented Jun 4, 2020

phd3 commented Jun 5, 2020 •

edited

Loading

phd3 commented Jun 5, 2020

JamesRTaylor commented Jun 5, 2020

martint commented Jun 6, 2020

phd3 commented Jun 9, 2020 •

edited

Loading

gjhkael commented Sep 13, 2021

mathfool commented Mar 19, 2024

mixermt commented Apr 2, 2024

Desmeister commented Apr 2, 2024

rotem-ad commented Apr 8, 2024

bentzimortaboola commented Apr 10, 2024

Prune projected nested columns through UNNEST #3925

Prune projected nested columns through UNNEST #3925

Comments

JamesRTaylor commented Jun 4, 2020

phd3 commented Jun 5, 2020 • edited Loading

phd3 commented Jun 5, 2020

JamesRTaylor commented Jun 5, 2020

martint commented Jun 6, 2020

phd3 commented Jun 9, 2020 • edited Loading

gjhkael commented Sep 13, 2021

mathfool commented Mar 19, 2024

mixermt commented Apr 2, 2024

Desmeister commented Apr 2, 2024

rotem-ad commented Apr 8, 2024

bentzimortaboola commented Apr 10, 2024

phd3 commented Jun 5, 2020 •

edited

Loading

phd3 commented Jun 9, 2020 •

edited

Loading