Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore predicate pushdown of metadata field in CheckpointEntryIterator #19157

Closed
wants to merge 1 commit into from

Conversation

ebyhr
Copy link
Member

@ebyhr ebyhr commented Sep 25, 2023

Description

#18423 caused a regression about #17408
This is a short-term solution until #19156

Release notes

(x) This is not user-visible or is docs only, and no release notes are required.

@cla-bot cla-bot bot added the cla-signed label Sep 25, 2023
@github-actions github-actions bot added the delta-lake Delta Lake connector label Sep 25, 2023
HiveColumnHandle metadata = columns.stream()
.filter(column -> column.getBaseColumnName().equals("metadata"))
.collect(onlyElement());
tupleDomain = buildTupleDomainColumnHandle(METADATA, metadata);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it the case that whenever "protocol" is non-null, then "metadata" is also guaranteed to be non-null ?
I'm wondering if this is safe given that this filter can exclude rows where "metadata" is null but "protocol" is non-null.

Copy link
Member Author

@ebyhr ebyhr Sep 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"metadata" is a required field and it's not nullable as far as I read Delta protocol.

https://github.com/delta-io/delta/blob/master/PROTOCOL.md#checkpoints-1

Each row in the checkpoint corresponds to a single action. The checkpoint must contain all information regarding the following actions:

  • The protocol version
  • The metadata of the table

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds okay to me, would be great if someone more familiar with delta spec than me would also review this

Copy link
Contributor

@findinpath findinpath Sep 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ebyhr the above quote doesn't actually specify that the protocol and metadata actions are in the same Parquet file row group.
AFAIU, #17408 is about retrieving only the row groups which correspond to non-null filters.

For finding protocol or metadata there are different Parquet filters used:

case METADATA -> {
field = "id";
type = VARCHAR;
}
case PROTOCOL -> {
field = "minReaderVersion";
type = BIGINT;
}

In the absence of composed Parquet predicate pushdown, we're probably left with reading two times the checkpoint file (although this means reading actually two row groups - which are likely the same most of the time).

@ebyhr ebyhr closed this Oct 2, 2023
@ebyhr ebyhr deleted the ebi/delta-checkpoint-entry branch October 2, 2023 04:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed delta-lake Delta Lake connector
Development

Successfully merging this pull request may close these issues.

3 participants