-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python: Infer Iceberg schema from the Parquet file #6997
Python: Infer Iceberg schema from the Parquet file #6997
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @JonasJ-ap for creating this PR, looks good! I've added some suggestions, let me know what you think!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fokko Thank you for reviewing this. I added my concerns about some details below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @JonasJ-ap for changing this to @singledispatch
. I think we're almost there, added some more suggestions. Let me know what you think!
python/pyiceberg/io/pyarrow.py
Outdated
"Iceberg schema is not embedded into the Parquet file, see https://github.com/apache/iceberg/issues/6505" | ||
) | ||
file_schema = Schema.parse_raw(schema_raw) | ||
# TODO: if field_ids are not present, another fallback level should be implemented to look them up in the table schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is Iceberg's "name mapping" feature.
The basics:
- Name mapping is applied ONLY if there are not field IDs in the Parquet file
- If a field is present in the name mapping, the ID from the mapping is used. Multiple names can be mapped to the same ID
- If a field is not present in the name mapping, it is omitted.
For PyIceberg, the name mapping would probably be implemented by pre-processing the PyArrow schema to add the PYTHON:field_id
properties. That way this code doesn't need to change.
As a result, if there is no field ID for any field, it should be omitted from the Iceberg schema.
python/pyiceberg/io/pyarrow.py
Outdated
return TimeType() | ||
elif pa.types.is_timestamp(primitive): | ||
primitive = cast(pa.TimestampType, primitive) | ||
if primitive.tz is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fokko, can we handle non-UTC time zones, or should we fail if this is not UTC or +00:00?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can handle non-utc timezones
python/pyiceberg/io/pyarrow.py
Outdated
return StringType() | ||
elif pa.types.is_date(primitive): | ||
return DateType() | ||
elif pa.types.is_time(primitive): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to check time or timestamp precision?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fokko, I think this could be a problem. We should make sure there are good tests for these types.
The read code operates using PyIceberg schemas. So that is going to convert PyArrow to PyIceberg, prune columns, and the convert back to request a projection from PyArrow. If there is a lossy conversion then that will cause a problem. For example if a time type is in millis but is converted without that information, then the requested type will be a timestamp in micros and that may cause a problem.
When looking at this, I found another problem. Rather than requesting a specific schema, the projected Iceberg schema is used to request top-level columns:
arrow_table = pq.read_table(
source=fout,
schema=parquet_schema,
pre_buffer=True,
buffer_size=8 * ONE_MEGABYTE,
filters=pyarrow_filter,
columns=[col.name for col in file_project_schema.columns],
)
I assume that's going to read full nested structures because the only information it has is the top-level column name. If that happens, I'm concerned that we're reading leaf columns that we don't actually need to. I think projection will correctly ignore them, but it would be nice to avoid reading them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The read code operates using PyIceberg schemas. So that is going to convert PyArrow to PyIceberg, prune columns, and the convert back to request a projection from PyArrow. If there is a lossy conversion then that will cause a problem. For example if a time type is in millis but is converted without that information, then the requested type will be a timestamp in micros and that may cause a problem.
Good point, I think we should have round-trip tests here, where we go from Iceberg to PyArrow, and back to Iceberg.
I assume that's going to read full nested structures because the only information it has is the top-level column name. If that happens, I'm concerned that we're reading leaf columns that we don't actually need to. I think projection will correctly ignore them, but it would be nice to avoid reading them.
I agree, I'm would have to look into the details if we prune the fields of the nested fields properly. In the future I'd like to replace this with the custom evolution strategy: apache/arrow#33972 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your explanation. I refactored the code to make it consistent with the primitive conversion in iceberg to pyarrow conversion. Currently, It only allow UTC as timezone and us as unit.
I also added some round trip tests for these types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still seems suspicious to me. While it's now correct, how does PyArrow read Python files that have other time or timestamp representations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your review and for bringing up your concerns. I'd like to understand better what you find suspicious about PyArrow's ability to read Python files with different time or timestamp representations.
From what I understand, Iceberg's TimeType, TimestampType, and TimestampzType require us
and UTC
, and my current conversion ensures that no data is lost.
In this case, PyArrow can support reading non-UTC timezones and s
, ms
, and us
precision, but it does not support nanosecond precision since the final requested type during projection will be us
and UTC
.:
iceberg/python/pyiceberg/io/pyarrow.py
Lines 397 to 404 in 283107d
def visit_time(self, _: TimeType) -> pa.DataType: | |
return pa.time64("us") | |
def visit_timestamp(self, _: TimestampType) -> pa.DataType: | |
return pa.timestamp(unit="us") | |
def visit_timestampz(self, _: TimestamptzType) -> pa.DataType: | |
return pa.timestamp(unit="us", tz="UTC") |
I chose to restrict the precision to us
and the timezone to UTC
because the Iceberg specification requires all stored time/timestamp to be in this precision and timezone. Since the pyarrow_to_schema
visitor is used to read an Iceberg table's data file, I believe we should only support us
and UTC
.
However, I am also not very sure about it. Regarding support for other precision and timezone here, I think more discussion and modifications may be needed if we want to add other support. How about creating another PR if needed to address these concerns?
Thank you again for your feedback, and please let me know if you have any further questions or concerns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is an Iceberg schema in the file, I think we can assume that it is written according to the spec:
With the current check, it is correct:
elif pa.types.is_time(primitive):
if isinstance(primitive, pa.Time64Type) and primitive.unit == "us":
return TimeType()
We could even simplify it:
elif isinstance(primitive, pa.Time64Type) and primitive.unit == "us":
return TimeType()
The tests are in place:
def test_pyarrow_time32_to_iceberg() -> None:
pyarrow_type = pa.time32("ms")
with pytest.raises(TypeError, match=re.escape("Unsupported type: time32[ms]")):
visit_pyarrow(pyarrow_type, _ConvertToIceberg())
pyarrow_type = pa.time32("s")
with pytest.raises(TypeError, match=re.escape("Unsupported type: time32[s]")):
visit_pyarrow(pyarrow_type, _ConvertToIceberg())
def test_pyarrow_time64_us_to_iceberg() -> None:
pyarrow_type = pa.time64("us")
converted_iceberg_type = visit_pyarrow(pyarrow_type, _ConvertToIceberg())
assert converted_iceberg_type == TimeType()
assert visit(converted_iceberg_type, _ConvertToArrowSchema()) == pyarrow_type
def test_pyarrow_time64_ns_to_iceberg() -> None:
pyarrow_type = pa.time64("ns")
with pytest.raises(TypeError, match=re.escape("Unsupported type: time64[ns]")):
visit_pyarrow(pyarrow_type, _ConvertToIceberg())
I think we can resolve this issue
Creates fragments based on the FileFormat. Blocked by: apache#6997
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small improvements, but looks good in general 👍🏻 Thanks @JonasJ-ap for working on this!
@JonasJ-ap I think we're good. Thanks so much for this PR and the refactoring of the tests. Let's @rdblue some time to hear if he has any final thoughts before merging. |
python/pyiceberg/io/pyarrow.py
Outdated
key_field = map_type.key_field | ||
key_id, _ = _get_field_id_and_doc(key_field) | ||
value_field = map_type.item_field | ||
value_id, _ = _get_field_id_and_doc(value_field) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why combine the lookup into a single method when doc is ignored most of the time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I separated it into _get_field_id
and _get_field_doc
python/pyiceberg/io/pyarrow.py
Outdated
return TimeType() | ||
elif pa.types.is_timestamp(primitive): | ||
primitive = cast(pa.TimestampType, primitive) | ||
if primitive.unit == "us" and primitive.tz == "UTC": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also accept a zero offset, like +00:00
?
@rdblue Thanks for your review and suggestions. I will try to update the PR by this Friday. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few nits, but looks good!
python/pyiceberg/io/pyarrow.py
Outdated
return StringType() | ||
elif pa.types.is_date(primitive): | ||
return DateType() | ||
elif pa.types.is_time(primitive): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is an Iceberg schema in the file, I think we can assume that it is written according to the spec:
With the current check, it is correct:
elif pa.types.is_time(primitive):
if isinstance(primitive, pa.Time64Type) and primitive.unit == "us":
return TimeType()
We could even simplify it:
elif isinstance(primitive, pa.Time64Type) and primitive.unit == "us":
return TimeType()
The tests are in place:
def test_pyarrow_time32_to_iceberg() -> None:
pyarrow_type = pa.time32("ms")
with pytest.raises(TypeError, match=re.escape("Unsupported type: time32[ms]")):
visit_pyarrow(pyarrow_type, _ConvertToIceberg())
pyarrow_type = pa.time32("s")
with pytest.raises(TypeError, match=re.escape("Unsupported type: time32[s]")):
visit_pyarrow(pyarrow_type, _ConvertToIceberg())
def test_pyarrow_time64_us_to_iceberg() -> None:
pyarrow_type = pa.time64("us")
converted_iceberg_type = visit_pyarrow(pyarrow_type, _ConvertToIceberg())
assert converted_iceberg_type == TimeType()
assert visit(converted_iceberg_type, _ConvertToArrowSchema()) == pyarrow_type
def test_pyarrow_time64_ns_to_iceberg() -> None:
pyarrow_type = pa.time64("ns")
with pytest.raises(TypeError, match=re.escape("Unsupported type: time64[ns]")):
visit_pyarrow(pyarrow_type, _ConvertToIceberg())
I think we can resolve this issue
python/pyiceberg/io/pyarrow.py
Outdated
"Iceberg schema is not embedded into the Parquet file, see https://github.com/apache/iceberg/issues/6505" | ||
) | ||
file_schema = Schema.parse_raw(schema_raw) | ||
# TODO: if field_ids are not present, Name Mapping should be implemented to look them up in the table schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we create an issue for this, and still raise an exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created the issue: #7451.
But I am not sure if what is the proper way to raise exception in this case. Based on my understanding, name mapping is also needed if portion of parquet fields miss the field ids. However, in this case, pyarrow_to_schema
can still generate a valid iceberg schema for the the rest of parquet fields. It seems we should not raise exception in this case.
Should we only raise exception when no field id exist in the whole data file? I think we can also log some warning messages when a pyarrow field containing a field id. What do you think?
Thanks, @JonasJ-ap! Great to have this improvement in. |
* Python: Add support for ORC Creates fragments based on the FileFormat. Blocked by: apache/iceberg#6997 * Revert * TableScan add limit * pyarrow limit number of rows fetched from files if limit is set * add tests for scan limit * python ci rebuild container if changes on python/dev/ * remove support for ORC * remove unused imports * increase sleep before running tests * update python docs to include limit in table query * docs fix format --------- Co-authored-by: Fokko Driesprong <[email protected]> Co-authored-by: Daniel Rückert García <[email protected]>
* Python: Add support for ORC Creates fragments based on the FileFormat. Blocked by: apache/iceberg#6997 * Revert * TableScan add limit * pyarrow limit number of rows fetched from files if limit is set * add tests for scan limit * python ci rebuild container if changes on python/dev/ * remove support for ORC * remove unused imports * increase sleep before running tests * update python docs to include limit in table query * docs fix format --------- Co-authored-by: Fokko Driesprong <[email protected]> Co-authored-by: Daniel Rückert García <[email protected]>
Problem Addressed:
This PR fixes #6505, #6647, and #7457.
It adds support to infer iceberg schema from parquet schema when the parquet file does not contain metadata holding the encoded iceberg schema.
Tests:
Working on unit tests.
Sample on AWS Athena (reproduce and fix the bug following the procedures in #6505 ):
OPTIMIZE type_test_ref_unpartitioned REWRITE DATA USING BIN_PACK;
On current master branch:
On this PR:
Indicating now the table can be read normally
.pyiceberg.yaml
: