PyIceberg doesn't support tables compacted with AWS Athena #7457

mikulskibartosz · 2023-04-28T10:03:17Z

Apache Iceberg version

1.1.0

Query engine

Athena

Please describe the bug 🐞

It's not possible to read an Iceberg table with PyIceberg if the data was written using PySpark and compacted with AWS Athena.

Steps to reproduce

Create an Iceberg table:

CREATE TABLE IF NOT EXISTS table_name
        (columns ...)
        USING ICEBERG
        PARTITIONED BY (date)

Write to the table using PySpark:

spark_df = self.spark_session.createDataFrame(df)
spark_df.sort(date_column).writeTo(table_name).append()

Read the table using PyIceberg:

catalog = load_glue("default", {})
table = catalog.load_table('...')

scan = table.scan(
    row_filter=EqualTo("date", date_as_string),
)
result = scan.to_arrow()

The result variable contains correct data.

Compact the table files using the OPTIMIZE instruction in AWS Athena. https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg-data-optimization.html

OPTIMIZE table_name REWRITE DATA USING BIN_PACK WHERE date = 'date_as_string'

Optionally, VACUUM the table. It doesn't matter and doesn't change the behavior in any way.
Query the table using the same PyIceberg code as in step 3.
to_arrow raises an exception: ValueError: Iceberg schema is not embedded into the Parquet file, see https://github.com/apache/iceberg/issues/6505
The table can still be accessed correctly in AWS Athena.

Expected behavior

In step 7, the code should work correctly and return the same results as the code in step 3.

Dependency versions

Writing data (step 2)

pyarrow: 11.0.0
pyspark: 3.3.1
iceberg-spark-runtime-3.3_2.12-1.1.0.jar

Reading data (steps 3 and 7):

pyiceberg.__version__
'0.3.0'

pyarrow.__version__
'10.0.1'

The text was updated successfully, but these errors were encountered:

Fokko · 2023-04-28T10:06:29Z

Thanks @mikulskibartosz for reporting this. Kudo's for the comprehensive issue. This is a known issue that we're working on and will be fixed in the next release: #6647

rdblue · 2023-05-02T00:07:51Z

Just merged #6505, which should address this.

Fokko mentioned this issue Apr 28, 2023

Python: Infer Iceberg schema from the Parquet file #6997

Merged

Fokko added this to the PyIceberg 0.4.0 release milestone Apr 28, 2023

Fokko added the python label Apr 28, 2023

rdblue closed this as completed May 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyIceberg doesn't support tables compacted with AWS Athena #7457

PyIceberg doesn't support tables compacted with AWS Athena #7457

mikulskibartosz commented Apr 28, 2023

Fokko commented Apr 28, 2023

rdblue commented May 2, 2023

PyIceberg doesn't support tables compacted with AWS Athena #7457

PyIceberg doesn't support tables compacted with AWS Athena #7457

Comments

mikulskibartosz commented Apr 28, 2023

Apache Iceberg version

Query engine

Please describe the bug 🐞

Steps to reproduce

Expected behavior

Dependency versions

Writing data (step 2)

Reading data (steps 3 and 7):

Fokko commented Apr 28, 2023

rdblue commented May 2, 2023