Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

write UUID fail on _check_schema_compatible #855

Closed
raphaelauv opened this issue Jun 25, 2024 · 6 comments
Closed

write UUID fail on _check_schema_compatible #855

raphaelauv opened this issue Jun 25, 2024 · 6 comments

Comments

@raphaelauv
Copy link

Apache Iceberg version

main (development)

Please describe the bug 🐞

I can't write a UUID in an iceberg table

from pyiceberg.catalog.rest import RestCatalog
from pyiceberg.schema import Schema
from pyiceberg.types import NestedField, UUIDType
import polars as pl
import uuid

catalog = RestCatalog(
    "default",
    **{
        "uri": "http://localhost:8181",
        "warehouse": "s3://test-bucket/",
        "s3.endpoint": "http://localhost:9020",
    },
)

catalog.create_namespace("default")
id_to_write = uuid.uuid4()

iceberg_schema = Schema(
    NestedField(1, "id", UUIDType(), required=True),
)
catalog.create_table(
    "default.aaa",
    schema=iceberg_schema,
)
df = pl.DataFrame({}).with_columns([pl.lit(id_to_write.bytes).alias("id")])

df = df.to_arrow()

df = df.cast(target_schema=iceberg_schema.as_arrow())

table = catalog.load_table("default.aaa")
table.append(df)

image

but if I comment the call to _check_schema_compatible then it write to the table

_check_schema_compatible(self._table.schema(), other_schema=df.schema)

and I can read the data with trino

Screenshot from 2024-06-25 13-43-09

@kevinjqliu
Copy link
Contributor

thanks for reporting this issue!

The _check_schema_compatible is currently more strict than it should be. In #829, the _check_schema_compatible check is relaxed.
Would #829 fix your issue above?

@raphaelauv
Copy link
Author

hey @kevinjqliu
I tried your PR it do not fix the insert of UUID

@kevinjqliu
Copy link
Contributor

kevinjqliu commented Jun 26, 2024

I see, I also verified that _check_schema_compatible errors.

Heres an example to repro:

def test_schema_uuid() -> None:
    import polars as pl

    iceberg_schema = Schema(
        NestedField(1, "id", UUIDType(), required=True),
    )

    id_to_write = uuid.uuid4()
    df = pl.DataFrame({}).with_columns([pl.lit(id_to_write.bytes).alias("id")])
    df = df.to_arrow()
    df = df.cast(target_schema=iceberg_schema.as_arrow())

    _check_schema_compatible(iceberg_schema, df.schema)

Looks like @Fokko opened an issue regarding UUID for arrow
apache/arrow#15058

@Fokko can you chime in here on writing UUID data type?

@Fokko
Copy link
Contributor

Fokko commented Jun 27, 2024

Thanks for pinging me here. So there is some progress on the Arrow side. There has been a vote to adopt the UUID type, and it has been added to the format.

Thanks for the example code @kevinjqliu:

image

And I would say that they are equivalent. So if we know that the field in the Iceberg table is a UUID, just writing a Fixed[16] is okay and should pass the compatibility check.

@sungwy
Copy link
Collaborator

sungwy commented Jul 16, 2024

Hi @raphaelauv thank you for raising this issue. #921 should fix this issue. Would you like to give it a try?

I've added a test including UUIDType to demonstrate that the fix will work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants