-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add all_manifests
metadata table with tests
#1241
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! I've added some comments!
pyiceberg/table/inspect.py
Outdated
import pyarrow as pa | ||
|
||
all_manifests_schema = get_manifests_schema() | ||
all_manifests_schema = all_manifests_schema.append(pa.field("reference_snapshot_id", pa.int64(), nullable=False)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interestingly, this isnt in the documentation https://iceberg.apache.org/docs/latest/spark-queries/#all-manifests
but only in the code https://github.com/apache/iceberg/blame/2b55fef7cc2a249d864ac26d85a4923313d96a59/core/src/main/java/org/apache/iceberg/AllManifestsTable.java#L53-L54
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's not present in iceberg docs.
) | ||
|
||
def manifests(self) -> "pa.Table": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wdyt about adding an optional snapshot_id
here? To allow users to look at the manifest for a specific snapshot, with the added benefit to iterate over all snapshot ids for all_manifests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I am aligned with this.
But there are two parameters that I'm passing to _generate_manifests_table
method - snapshot_id and a boolean flag whether the output is for all_manifests table which add the additional column to all_manifests
table.
So I'll need to add this second parameter for manifests method as well. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea i think thats fine since _generate_manifests_table
is internal
for column in df.column_names: | ||
for left, right in zip(lhs[column].to_list(), rhs[column].to_list()): | ||
assert left == right, f"Difference in column {column}: {left} != {right}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: is it possible to use assert_frame_equal
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, making the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for adding this metadata table!
@soumya-ghosh I see this one is still pending, are you still interested to get this in? |
Hey @Fokko nothing major is pending on my side, awaiting your approval. I will resolve the conflicts shortly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! thank you for following up
) | ||
|
||
def manifests(self) -> "pa.Table": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea i think thats fine since _generate_manifests_table
is internal
Implements
all_manifests
metadata table - #1053Have refactored the code tor re-use logic of
manifests
metadata table.The schema of
all_manifests
contains an additional column as compared tomanifests
table - reference_snapshot_id which indicates the snapshot id those manifests are contained in.Ref - Iceberg implementation - here and here