Prune Non-referenced Fields from Nested RowTypes #23074

rmarrowstone · 2024-08-19T23:57:11Z

This set of changes prunes nested RowTypes to only the fields that are
actually referenced in the users' projections.

The Parquet implementation already solves for this, but it works on
it's own abstractions so it's not fit for use in the other Hive
formats. I believe this approach could be adopted by the Parquet
PageSource as well, thereby simplifying, but I don't want to bite that
off now.

I believe the approach will work for Avro as well, but the PageSource
isn't plumbing the inferred reader schema down to the type resolver:
it is just passing the selected columns from the writer schema as both
reader and writer.

I added a test that proves it works well for OpenXJson because it
is simple to mock data for it and it supports position-based
deserialization: a JSON Array into a Row. That, along with the changes
to the SEQUENCE format, reflect that this approach should work for
any implementation's needs.

Description

I discovered this while starting an implementation for the Amazon Ion
format. I was curious about projectBaseColumns, what it did and how
it did it. I was surprised to find that the complete structures were being
materialized. I wanted to fix it in a way that would also for Ion, while
learning about the relevant parts of the Trino codebase.

I found a clustering of prior issues and believe there have been some
related threads in the Trino Slack workspace.

I didn't try to tackle anything that touched the optimizer, just took as a
given what the Hive connectors get today. I realize that means that
pruning based on array position or involving functions is out of scope.

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
(x) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

cla-bot · 2024-08-19T23:57:13Z

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

This set of changes prunes nested RowTypes to only the fields that are actually dereferenced in the users' projections. The Parquet implementation already solves for this, but it works on it's own abstractions so it's not fit for use in the other Hive formats. I believe this approach could be adopted by the Parquet PageSource as well, thereby simplifying, but I don't want to bite that off now. I believe the approach will work for Avro as well, but the PageSource isn't plumbing the inferred reader schema down to the type resolver: it is just passing the selected columns from the writer schema as both reader and writer. I added a test that proves it works well for OpenXJson because it is simple to mock data for it and it supports position-based deserialization: a JSON Array into a Row.

cla-bot · 2024-08-20T18:50:57Z

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

cla-bot · 2024-08-20T21:31:07Z

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

This reverts commit b96dc9e.

cla-bot · 2024-08-20T22:40:35Z

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

github-actions · 2024-09-11T17:03:01Z

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

rmarrowstone · 2024-09-11T17:19:26Z

I'm going to Close this for now. I believe there is value to a general and more complete schema pruning. Will revisit when there is either more general interest, or I have a more pressing need and/or data.

github-actions bot added the hive Hive connector label Aug 19, 2024

rmarrowstone force-pushed the prune-nested-projections branch from cdb4cd9 to ddb2f0a Compare August 20, 2024 18:50

rmarrowstone added 2 commits August 20, 2024 11:54

WIP Avro and Test Changes

b96dc9e

Make work for SEQUENCE file

467ae6d

Revert "WIP Avro and Test Changes"

364dc92

This reverts commit b96dc9e.

rmarrowstone marked this pull request as ready for review August 20, 2024 23:16

github-actions bot added the stale label Sep 11, 2024

rmarrowstone closed this Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prune Non-referenced Fields from Nested RowTypes #23074

Prune Non-referenced Fields from Nested RowTypes #23074

rmarrowstone commented Aug 19, 2024 •

edited

Loading

cla-bot bot commented Aug 19, 2024

cla-bot bot commented Aug 20, 2024

cla-bot bot commented Aug 20, 2024

cla-bot bot commented Aug 20, 2024

github-actions bot commented Sep 11, 2024

rmarrowstone commented Sep 11, 2024

Prune Non-referenced Fields from Nested RowTypes #23074

Prune Non-referenced Fields from Nested RowTypes #23074

Conversation

rmarrowstone commented Aug 19, 2024 • edited Loading

Description

Additional context and related issues

Release notes

cla-bot bot commented Aug 19, 2024

cla-bot bot commented Aug 20, 2024

cla-bot bot commented Aug 20, 2024

cla-bot bot commented Aug 20, 2024

github-actions bot commented Sep 11, 2024

rmarrowstone commented Sep 11, 2024

rmarrowstone commented Aug 19, 2024 •

edited

Loading