Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce Iceberg metadata scans #7367

Merged
merged 3 commits into from
Apr 30, 2021
Merged

Conversation

phd3
Copy link
Member

@phd3 phd3 commented Mar 20, 2021

#7336

This PR

  • reduces scans of metadata file on the read path by reusing Iceberg's TableMetadata
  • reduces scans of snapshot file on the read path by lazily initializing file iterator during IcebergMetadata#getTableProperties

some other related things to discuss:

  • Iceberg's TableMetadata does not provide fully immutable objects (e.g. some lazy loading in BaseSnapshot, Schema etc). On a cursory look, iceberg code seems to have synchronization (or rely on reference assignments being atomic and marking them as volatile), but not sure if it's ideal to trust that as the spec/api-docs don't say anything. Adding a wrapper on Iceberg's TableMetadata isn't very useful since we need to use the full TableMetadata object in IcebergSplitManager anyway. (It'd require bigger refactoring if we want to create replica of Iceberg objects in trino)

  • getTableStatistics also invokes planFiles, which causes repeated iterations on metadata, snapshot and manifest files. however, invocations of getTableStatistics during planning may also have different predicates. This PR doesn't cache/optimize it.

  • Streams.stream(combinedScanIterable) seems to cause manifest files to be read twice, but it shouldn't need to, imo this needs to be fixed in Iceberg.

@cla-bot cla-bot bot added the cla-signed label Mar 20, 2021
@phd3 phd3 added the WIP label Mar 20, 2021
@sopel39
Copy link
Member

sopel39 commented Mar 22, 2021

Also io.trino.plugin.iceberg.IcebergMetadata#getMaterializedView is calling metastore.getTable twice (once in main body and second in isMaterializedView)

@phd3 phd3 force-pushed the fix-iceberg-metascans branch 4 times, most recently from 59413d1 to 809e773 Compare March 29, 2021 02:40
@phd3 phd3 removed the WIP label Mar 29, 2021
@phd3 phd3 changed the title [WIP] Reduce metadata scans Reduce metadata scans Mar 29, 2021
@phd3 phd3 requested a review from electrum March 29, 2021 05:13
@phd3
Copy link
Member Author

phd3 commented Mar 29, 2021

@sopel39 thanks, will fix it in a separate PR

@findepi findepi changed the title Reduce metadata scans Reduce Iceberg metadata scans Mar 29, 2021
@phd3 phd3 force-pushed the fix-iceberg-metascans branch 2 times, most recently from b75337f to 9355941 Compare April 5, 2021 05:35
@phd3 phd3 requested a review from electrum April 5, 2021 15:48
@Parth-Brahmbhatt
Copy link
Member

I haven't looked at the code but I recommend looking at https://iceberg.apache.org/javadoc/0.11.0/org/apache/iceberg/CachingCatalog.html and moving the connector to catalog Apis and just using this.

Copy link
Member

@electrum electrum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

return TupleDomain.fromFixedValues(partitionValues);
});
Iterable<TupleDomain<ColumnHandle>> discreteTupleDomain = Iterables.transform(
// Avoid invoking tableScan.planFiles() eagerly which fetches metadata file and manifest lists. It
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the comment below for new ConnectorTableProperties still accurate?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I think so. that comment talks about narrowing down enforcedPredicate by using the discretePredicates, which we're still not doing - the same reason as before.

Cache Iceberg's TableMetadata object so that contents
of metadata file can be reused.
Lazily initialize iterator that lists data files in
getTableProperties.
@phd3
Copy link
Member Author

phd3 commented Apr 29, 2021

@electrum applied comments.

@phd3 phd3 requested review from electrum and removed request for electrum April 29, 2021 01:43
@phd3 phd3 merged commit b8bcec2 into trinodb:master Apr 30, 2021
@phd3
Copy link
Member Author

phd3 commented Apr 30, 2021

Merged #7367.

@phd3 phd3 mentioned this pull request Apr 30, 2021
9 tasks
@phd3 phd3 added this to the 356 milestone Apr 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants