-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prune projected nested columns through UNNEST #3925
Comments
@JamesRTaylor I think there're multiple steps to this one:
to
Here E is completely pruned, and A.B.C is extracted from A.
I feel 3) won't be impactful enough without 1) and 2) for all query shapes. I've a WIP patch for 1) but needs some more work. But my hunch is, that patch, if checked-in alone, may degrade performance since UNNEST operator avoids data copy as much as possible already. |
@martint what are your thoughts on this? |
Thanks for the write-up, @phd3. I was thinking (1) and (2) plus using a ConnectorExpression that would contain pruned type definitions. I'd guess that the main improvement would come from the Parquet/ORC reader only reading A_B_C instead of all of A (very similar to the recent improvements for other nested data situations). Would it work to use the Variable class with a type that includes only the referenced sub fields? I'd love to take a look at your patch. I could benchmark it on some production queries on our end if it's far enough along so we get an idea of potential impact. |
The current plan to support this is the transformation you described in (1), plus the ability to push down functions (e.g., |
@JamesRTaylor sure. I need to understand the case of NULLs in unnest. i.e. make sure that the number of rows output by |
any update? |
Any update on this one? |
2024 and most advanced query execution engines still can't read nested data efficiently 🤷♂️ |
Quick update. I had a stable build, but was continuously refactoring on top of Martin's recent AST/IR changes so I opted to wait on completion of those. I'm not expecting large changes after the latest major PR, so I'm fixing up the new classes and unit tests based on the new IR-friendly syntax. |
@martint @Desmeister This feature can be a serious game changer for our main use-case. When running simple analytical queries with The only workaround we've come up with so far is 'tricking' Trino into thinking the table only has the subset of the columns required by the query. We do this by creating a Hive external table with the modified schema on top of the Iceberg table's data folder. Do you happen to know if this feature is planned to be merged any time soon? |
@Desmeister Can you provide ETA for PR? |
When CROSS JOIN UNNEST is used to access some columns within an array of rows, the entire row is projected as opposed to only the subset of columns based on the references. For example:
Only
a.x
should be projected, but currently all columns ofa
are projected. For very wide rows, this can be expensive.The text was updated successfully, but these errors were encountered: