Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: Support aggregated basic stats in partition summary #11669

Closed
wants to merge 1 commit into from

Conversation

deniskuzZ
Copy link
Member

@deniskuzZ deniskuzZ commented Nov 27, 2024

@github-actions github-actions bot added the core label Nov 27, 2024
@pvary
Copy link
Contributor

pvary commented Nov 27, 2024

@deniskuzZ: Could you please provide a short description what data is stored in the summary and in what format?

I think it is important to understand the cost for keeping this stat up-to-date. How costly is to calculate it, and what is the data size increase caused by this change.

@findepi: Could this be useful for Trino? Does Trino have some optimiyation like this?

@pvary
Copy link
Contributor

pvary commented Nov 27, 2024

This discussion could be relevant here too: https://lists.apache.org/thread/0q1csnkfg8jc11zo1dlssjkr4v8s8zz0

@deniskuzZ
Copy link
Member Author

@pvary, unfortunately, that won't work. I was looking for an easy way to get basic partition stats, however, I missed the part that iceberg only keeps the changed partitions in a SnapshotSummary. Aggregation with just the prev snapshot value is not enough, it requires loop through all the snapshots.

table.newFastAppend().appendFile(FILE_A).commit();
partitions.data_bucket=0 -> added-data-files=1,added-records=1,added-files-size=10,total-records=3,total-files-size=30,total-data-files=3,total-delete-files=0,total-position-deletes=0,total-equality-deletes=0

table.newFastAppend().appendFile(FILE_B).commit();
partitions.data_bucket=1 -> added-data-files=1,added-records=1,added-files-size=10,total-records=2,total-files-size=20,total-data-files=2,total-delete-files=0,total-position-deletes=0,total-equality-deletes=0

table.newFastAppend().appendFile(FILE_A).commit();
partitions.data_bucket=0 -> added-data-files=1,added-records=1,added-files-size=10,total-records=3,total-files-size=30,total-data-files=3,total-delete-files=0,total-position-deletes=0,total-equality-deletes=0

do you think it's worth doing it in SnapshotSummary or is there some simpler/better way like create or update the partition stats puffin file right after the commit?

@deniskuzZ deniskuzZ closed this Nov 28, 2024
@deniskuzZ
Copy link
Member Author

Found partition stats tracker issue #8450 with the following design doc: https://docs.google.com/document/d/1vaufuD47kMijz97LxM67X8OX-W2Wq7nmlz3jRo8J5Qk
But it doesn't seem to be completed yet: #11216

@pvary
Copy link
Contributor

pvary commented Nov 28, 2024

And here is the relevant mailing list thread: https://lists.apache.org/thread/knl1ol7s1o2p7rglgl2mm8c5mc2pk0sx

@ajantha-bhat: Are you still working on the proposal?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants