Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: schema check of iceberg logical types #856

Closed
wants to merge 2 commits into from
Closed

fix: schema check of iceberg logical types #856

wants to merge 2 commits into from

Conversation

raphaelauv
Copy link

close: #855

@raphaelauv
Copy link
Author

this is a trivial implementation , probably not enough. thanks all

@raphaelauv
Copy link
Author

@kevinjqliu wdyt ?

@kevinjqliu
Copy link
Contributor

I'm +1 on this change in theory. I feel like _check_schema_compatible should be as non-blocking as possible, i.e. if pyarrow can write the dataset, _check_schema_compatible should allow it.

I wonder if there's a more generalized solution for this instead of hardcoding UUID to FixedType(16) conversion.
@syun64 @HonahX @Fokko wdyt

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @raphaelauv for jumping on this right away.

Could we make this part of the pyarrow_to_schema method? I think we could lookup fields there to double-check if we need to apply any logical types.

This way we can also fix issues like #830.

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved
@raphaelauv
Copy link
Author

thanks for the review @Fokko, to make it part of pyarrow_to_schema we must change a lot of things to propagate the table_schema ( that is an iceberg schema ) , that's what I tried first and then I reverted and made a separate function _apply_logical_conversion

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @raphaelauv this looks like a great start! 🙌

Could you add two tests as well? One with a UUID as a top level field, and one where it is nested inside of a struct? Thanks!

@@ -153,7 +155,21 @@
ALWAYS_TRUE = AlwaysTrue()
TABLE_ROOT_ID = -1

_JAVA_LONG_MAX = 9223372036854775807

def _apply_logical_conversion(table_schema: Schema, task_schema: Schema) -> Schema:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can defer this to a later PR, but just want to put it out here.

In PyIceberg we have a little bit an obsession with the visitor pattern. Using the SchemaWithPartnerVisitor you can traverse two schema's at once. An example is the ArrowProjectionVisitor and can be found in pyarrow.py. The ArrowAccessor does the lookups by ID, but I think here name makes more sense since we don't have the IDs (yet).

@raphaelauv
Copy link
Author

fix by #921

@raphaelauv raphaelauv closed this Jul 16, 2024
@raphaelauv raphaelauv deleted the fix/check_logical_types_arrow_iceberg branch July 16, 2024 13:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

write UUID fail on _check_schema_compatible
3 participants