Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add all_manifests metadata table with tests #1241

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

soumya-ghosh
Copy link
Contributor

@soumya-ghosh soumya-ghosh commented Oct 20, 2024

Implements all_manifests metadata table - #1053

Have refactored the code tor re-use logic of manifests metadata table.

The schema of all_manifests contains an additional column as compared to manifests table - reference_snapshot_id which indicates the snapshot id those manifests are contained in.
Ref - Iceberg implementation - here and here

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I've added some comments!

pyiceberg/table/inspect.py Outdated Show resolved Hide resolved
import pyarrow as pa

all_manifests_schema = get_manifests_schema()
all_manifests_schema = all_manifests_schema.append(pa.field("reference_snapshot_id", pa.int64(), nullable=False))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's not present in iceberg docs.

pyiceberg/table/inspect.py Show resolved Hide resolved
)

def manifests(self) -> "pa.Table":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wdyt about adding an optional snapshot_id here? To allow users to look at the manifest for a specific snapshot, with the added benefit to iterate over all snapshot ids for all_manifests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I am aligned with this.
But there are two parameters that I'm passing to _generate_manifests_table method - snapshot_id and a boolean flag whether the output is for all_manifests table which add the additional column to all_manifests table.
So I'll need to add this second parameter for manifests method as well. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea i think thats fine since _generate_manifests_table is internal

Comment on lines 938 to 940
for column in df.column_names:
for left, right in zip(lhs[column].to_list(), rhs[column].to_list()):
assert left == right, f"Difference in column {column}: {left} != {right}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is it possible to use assert_frame_equal here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, making the change.

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for adding this metadata table!

@Fokko
Copy link
Contributor

Fokko commented Nov 19, 2024

@soumya-ghosh I see this one is still pending, are you still interested to get this in?

@soumya-ghosh
Copy link
Contributor Author

Hey @Fokko nothing major is pending on my side, awaiting your approval. I will resolve the conflicts shortly.

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! thank you for following up

)

def manifests(self) -> "pa.Table":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea i think thats fine since _generate_manifests_table is internal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants