Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python: Infer Iceberg schema from the Parquet file #6997
Python: Infer Iceberg schema from the Parquet file #6997
Changes from 36 commits
566ab9b
0b4aec0
f542007
8bf01a8
59be7e6
2505752
7c04fad
af42d2b
b5417a8
0a355e0
b742b63
d1c1e37
991150a
e53f1c9
9f1e5ff
d78b1f0
35e076d
396990e
fe9e3a7
a222a57
13b1361
b18ddd3
93956e9
f07259c
3162919
40d39a6
736332c
8dbb5ad
be83eb0
e7871e6
3b51c7f
6da6bae
35b04d2
d3ae307
1304481
b561273
52138cc
9c26acc
6c28d51
5fffa63
615b700
ccd4606
4c1c3cd
0f17563
878054d
ed45a98
cd6eeb5
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why combine the lookup into a single method when doc is ignored most of the time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I separated it into
_get_field_id
and_get_field_doc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to check time or timestamp precision?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fokko, I think this could be a problem. We should make sure there are good tests for these types.
The read code operates using PyIceberg schemas. So that is going to convert PyArrow to PyIceberg, prune columns, and the convert back to request a projection from PyArrow. If there is a lossy conversion then that will cause a problem. For example if a time type is in millis but is converted without that information, then the requested type will be a timestamp in micros and that may cause a problem.
When looking at this, I found another problem. Rather than requesting a specific schema, the projected Iceberg schema is used to request top-level columns:
I assume that's going to read full nested structures because the only information it has is the top-level column name. If that happens, I'm concerned that we're reading leaf columns that we don't actually need to. I think projection will correctly ignore them, but it would be nice to avoid reading them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I think we should have round-trip tests here, where we go from Iceberg to PyArrow, and back to Iceberg.
I agree, I'm would have to look into the details if we prune the fields of the nested fields properly. In the future I'd like to replace this with the custom evolution strategy: apache/arrow#33972 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your explanation. I refactored the code to make it consistent with the primitive conversion in iceberg to pyarrow conversion. Currently, It only allow UTC as timezone and us as unit.
I also added some round trip tests for these types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still seems suspicious to me. While it's now correct, how does PyArrow read Python files that have other time or timestamp representations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your review and for bringing up your concerns. I'd like to understand better what you find suspicious about PyArrow's ability to read Python files with different time or timestamp representations.
From what I understand, Iceberg's TimeType, TimestampType, and TimestampzType require
us
andUTC
, and my current conversion ensures that no data is lost.In this case, PyArrow can support reading non-UTC timezones and
s
,ms
, andus
precision, but it does not support nanosecond precision since the final requested type during projection will beus
andUTC
.:iceberg/python/pyiceberg/io/pyarrow.py
Lines 397 to 404 in 283107d
I chose to restrict the precision to
us
and the timezone toUTC
because the Iceberg specification requires all stored time/timestamp to be in this precision and timezone. Since thepyarrow_to_schema
visitor is used to read an Iceberg table's data file, I believe we should only supportus
andUTC
.However, I am also not very sure about it. Regarding support for other precision and timezone here, I think more discussion and modifications may be needed if we want to add other support. How about creating another PR if needed to address these concerns?
Thank you again for your feedback, and please let me know if you have any further questions or concerns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is an Iceberg schema in the file, I think we can assume that it is written according to the spec:
With the current check, it is correct:
We could even simplify it:
The tests are in place:
I think we can resolve this issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also accept a zero offset, like
+00:00
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we create an issue for this, and still raise an exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created the issue: #7451.
But I am not sure if what is the proper way to raise exception in this case. Based on my understanding, name mapping is also needed if portion of parquet fields miss the field ids. However, in this case,
pyarrow_to_schema
can still generate a valid iceberg schema for the the rest of parquet fields. It seems we should not raise exception in this case.Should we only raise exception when no field id exist in the whole data file? I think we can also log some warning messages when a pyarrow field containing a field id. What do you think?