-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python: Infer Iceberg schema from the Parquet file #6505
Comments
I'm interested in solving this issue. Would you mind assigning it to me? Thank you so much! |
@JonasJ-ap Anything I can help with? If you don't have time, maybe @amogh-jahagirdar is interested in picking this up. I'd love to get this in 0.4.0 |
Sorry that I haven't got enough time to work this out. @amogh-jahagirdar please feel free to pick this up if you are interested in. |
Hello, I wanted to report that I've also observed this issue. Adding some details about how I got into this state in case it's helpful. I've created an iceberg table via AWS glue: partition_column = 'id'
partition_bucket_size = 4
udf_name = 'iceberg_bucket_long_' + str(partition_bucket_size)
spark.sparkContext._jvm.org.apache.iceberg.spark.IcebergSpark.registerBucketUDF(
spark._jsparkSession, udf_name, spark.sparkContext._jvm.org.apache.spark.sql.types.DataTypes.LongType, partition_bucket_size)
df = df.sortWithinPartitions(F.expr(f"{udf_name}({partition_column})"))
df = df.writeTo('my_iceberg_table') \
.partitionedBy(F.bucket(partition_bucket_size, partition_column))
.createOrReplace() At this point I could read the table fine via Athena and OPTIMIZE my_iceberg_table REWRITE DATA USING BIN_PACK After this had completed successfully, I was able to still query the table from Athena but no longer from ValueError: Iceberg schema is not embedded into the Parquet file, see https://github.com/apache/iceberg/issues/6505 Let me know if there are any more details I can provide |
Just for further information I'll add here a code snippet that leads to the same error message from pyiceberg.catalog import load_catalog
from pyiceberg.expressions import EqualTo
# pyiceberg.yaml
# catalog:
# default:
# type: glue
# py-io-impl: pyiceberg.io.pyarrow.PyArrowFileIO
catalog = load_catalog(
"default",
warehouse="...",
)
table = catalog.load_table(("...", "..."))
df = (
table.scan()
.filter(EqualTo("uuid", "..."))
.select("rt", "cs1", "in")
.to_arrow()
)
print(df) |
I created a draft PR #6997 containing a raw visitor to support inferring iceberg schema and verified that the new feature could solve the problem described above and in #6647. @amogh-jahagirdar Please let me know if you are working on this or still interested in picking this up. I am willing to re-pick this issue if you do not have enough time. |
Ciao @Fokko, maybe I'm facing a similar issue, but I'm a bit confused. The table in question derives from the open dataset of NY taxis. import os
from pyiceberg.catalog import load_glue
catalog = load_glue(name='biglake', conf={})
table = catalog.load_table('biglake.taxi_dremio_by_month')
print(table.identifier)
print(table.metadata)
print(table.metadata_location)
con = table.scan().to_duckdb(table_name='taxi')
print(con.execute('SELECT COUNT(*) FROM taxi').fetchall()) This is the output:
And then it crashes:
I'm confused because the query is a simple I've also tested PR #6997, but the python operator crashed:
|
@bigluck Thanks for giving it a try.
Unfortunately, with the current DuckDB implementation, it pulls in all the (relevant) data. Since there is no filter on the scan, this means the entire table. How big is the table? Could it be that it runs out of memory? Running |
Oh, I've got it, thanks @Fokko . The exit code is 137, OOM :) |
@Fokko @JonasJ-ap what's the status of dealing with this issue? How can I help to have a fix for this included in version 0.4.0? |
@sheinbergon The PR had been merged and will be part of the 0.4.0 release |
Feature Request / Improvement
In PyIceberg we rely on fetching the schema from the Parquet metadata. If this is not available (because the parquet file is written by something else than an Iceberg writer), we want to go over the actual schema and construct the Iceberg schema from it.
Query engine
None
The text was updated successfully, but these errors were encountered: