write UUID fail on _check_schema_compatible #855

raphaelauv · 2024-06-25T11:49:33Z

Apache Iceberg version

main (development)

Please describe the bug 🐞

I can't write a UUID in an iceberg table

from pyiceberg.catalog.rest import RestCatalog
from pyiceberg.schema import Schema
from pyiceberg.types import NestedField, UUIDType
import polars as pl
import uuid

catalog = RestCatalog(
    "default",
    **{
        "uri": "http://localhost:8181",
        "warehouse": "s3://test-bucket/",
        "s3.endpoint": "http://localhost:9020",
    },
)

catalog.create_namespace("default")
id_to_write = uuid.uuid4()

iceberg_schema = Schema(
    NestedField(1, "id", UUIDType(), required=True),
)
catalog.create_table(
    "default.aaa",
    schema=iceberg_schema,
)
df = pl.DataFrame({}).with_columns([pl.lit(id_to_write.bytes).alias("id")])

df = df.to_arrow()

df = df.cast(target_schema=iceberg_schema.as_arrow())

table = catalog.load_table("default.aaa")
table.append(df)

but if I comment the call to _check_schema_compatible then it write to the table

iceberg-python/pyiceberg/table/__init__.py

Line 485 in a6cd0cf

_check_schema_compatible(self._table.schema(), other_schema=df.schema)

and I can read the data with trino

The text was updated successfully, but these errors were encountered:

kevinjqliu · 2024-06-25T19:12:22Z

thanks for reporting this issue!

The _check_schema_compatible is currently more strict than it should be. In #829, the _check_schema_compatible check is relaxed.
Would #829 fix your issue above?

raphaelauv · 2024-06-26T07:13:50Z

hey @kevinjqliu
I tried your PR it do not fix the insert of UUID

kevinjqliu · 2024-06-26T16:01:14Z

I see, I also verified that _check_schema_compatible errors.

Heres an example to repro:

def test_schema_uuid() -> None:
    import polars as pl

    iceberg_schema = Schema(
        NestedField(1, "id", UUIDType(), required=True),
    )

    id_to_write = uuid.uuid4()
    df = pl.DataFrame({}).with_columns([pl.lit(id_to_write.bytes).alias("id")])
    df = df.to_arrow()
    df = df.cast(target_schema=iceberg_schema.as_arrow())

    _check_schema_compatible(iceberg_schema, df.schema)

Looks like @Fokko opened an issue regarding UUID for arrow
apache/arrow#15058

@Fokko can you chime in here on writing UUID data type?

Fokko · 2024-06-27T07:57:31Z

Thanks for pinging me here. So there is some progress on the Arrow side. There has been a vote to adopt the UUID type, and it has been added to the format.

Thanks for the example code @kevinjqliu:

And I would say that they are equivalent. So if we know that the field in the Iceberg table is a UUID, just writing a Fixed[16] is okay and should pass the compatibility check.

sungwy · 2024-07-16T02:09:16Z

Hi @raphaelauv thank you for raising this issue. #921 should fix this issue. Would you like to give it a try?

I've added a test including UUIDType to demonstrate that the fix will work

kevinjqliu · 2024-07-17T17:02:41Z

Fixed in #921

Specifically, tested here
https://github.com/apache/iceberg-python/pull/921/files#diff-e52e4ddd58b7ef887ab03c04116e676f6280b824ab7469d5d3080e5cba4f2128R2535

https://github.com/apache/iceberg-python/pull/921/files#diff-7f3dd1244d08ce27c003cd091da10aa049f7bb0c7d5397acb4ec69767036accdR1044

This was referenced Jun 25, 2024

Python: Fix UUID representation apache/iceberg#8248

Closed

fix: schema check of iceberg logical types #856

Closed

HonahX added this to the PyIceberg 0.7.0 release milestone Jul 10, 2024

HonahX mentioned this issue Jul 12, 2024

Allow writing pa.Table that are either a subset of table schema or in arbitrary order, and support type promotion on write #921

Merged

kevinjqliu closed this as completed Jul 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

write UUID fail on _check_schema_compatible #855

write UUID fail on _check_schema_compatible #855

raphaelauv commented Jun 25, 2024

kevinjqliu commented Jun 25, 2024

raphaelauv commented Jun 26, 2024

kevinjqliu commented Jun 26, 2024 •

edited by Fokko

Loading

Fokko commented Jun 27, 2024

sungwy commented Jul 16, 2024

kevinjqliu commented Jul 17, 2024

write UUID fail on _check_schema_compatible #855

write UUID fail on _check_schema_compatible #855

Comments

raphaelauv commented Jun 25, 2024

Apache Iceberg version

Please describe the bug 🐞

kevinjqliu commented Jun 25, 2024

raphaelauv commented Jun 26, 2024

kevinjqliu commented Jun 26, 2024 • edited by Fokko Loading

Fokko commented Jun 27, 2024

sungwy commented Jul 16, 2024

kevinjqliu commented Jul 17, 2024

kevinjqliu commented Jun 26, 2024 •

edited by Fokko

Loading