Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prune Non-referenced Fields from Nested RowTypes #23074

Closed

Commits on Aug 20, 2024

  1. Prune Non-referenced Fields from Nested RowTypes

    This set of changes prunes nested RowTypes to only the fields that are
    actually dereferenced in the users' projections.
    
    The Parquet implementation already solves for this, but it works on
    it's own abstractions so it's not fit for use in the other Hive
    formats. I believe this approach could be adopted by the Parquet
    PageSource as well, thereby simplifying, but I don't want to bite that
    off now.
    
    I believe the approach will work for Avro as well, but the PageSource
    isn't plumbing the inferred reader schema down to the type resolver:
    it is just passing the selected columns from the writer schema as both
    reader and writer.
    
    I added a test that proves it works well for OpenXJson because it
    is simple to mock data for it and it supports position-based
    deserialization: a JSON Array into a Row.
    rmarrowstone committed Aug 20, 2024
    Configuration menu
    Copy the full SHA
    ddb2f0a View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    b96dc9e View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    467ae6d View commit details
    Browse the repository at this point in the history
  4. Revert "WIP Avro and Test Changes"

    This reverts commit b96dc9e.
    rmarrowstone committed Aug 20, 2024
    Configuration menu
    Copy the full SHA
    364dc92 View commit details
    Browse the repository at this point in the history