-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restore predicate pushdown of metadata field in CheckpointEntryIterator #19157
Conversation
HiveColumnHandle metadata = columns.stream() | ||
.filter(column -> column.getBaseColumnName().equals("metadata")) | ||
.collect(onlyElement()); | ||
tupleDomain = buildTupleDomainColumnHandle(METADATA, metadata); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it the case that whenever "protocol" is non-null, then "metadata" is also guaranteed to be non-null ?
I'm wondering if this is safe given that this filter can exclude rows where "metadata" is null but "protocol" is non-null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"metadata" is a required field and it's not nullable as far as I read Delta protocol.
https://github.com/delta-io/delta/blob/master/PROTOCOL.md#checkpoints-1
Each row in the checkpoint corresponds to a single action. The checkpoint must contain all information regarding the following actions:
- The protocol version
- The metadata of the table
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds okay to me, would be great if someone more familiar with delta spec than me would also review this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ebyhr the above quote doesn't actually specify that the protocol and metadata actions are in the same Parquet file row group.
AFAIU, #17408 is about retrieving only the row groups which correspond to non-null filters.
For finding protocol or metadata there are different Parquet filters used:
Lines 244 to 251 in e1dc146
case METADATA -> { | |
field = "id"; | |
type = VARCHAR; | |
} | |
case PROTOCOL -> { | |
field = "minReaderVersion"; | |
type = BIGINT; | |
} |
In the absence of composed Parquet predicate pushdown, we're probably left with reading two times the checkpoint file (although this means reading actually two row groups - which are likely the same most of the time).
Description
#18423 caused a regression about #17408
This is a short-term solution until #19156
Release notes
(x) This is not user-visible or is docs only, and no release notes are required.