Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyIceberg doesn't support tables compacted with AWS Athena #7457

Closed
mikulskibartosz opened this issue Apr 28, 2023 · 2 comments
Closed

PyIceberg doesn't support tables compacted with AWS Athena #7457

mikulskibartosz opened this issue Apr 28, 2023 · 2 comments

Comments

@mikulskibartosz
Copy link

Apache Iceberg version

1.1.0

Query engine

Athena

Please describe the bug 🐞

It's not possible to read an Iceberg table with PyIceberg if the data was written using PySpark and compacted with AWS Athena.

Steps to reproduce

  1. Create an Iceberg table:
CREATE TABLE IF NOT EXISTS table_name
        (columns ...)
        USING ICEBERG
        PARTITIONED BY (date)
  1. Write to the table using PySpark:
spark_df = self.spark_session.createDataFrame(df)
spark_df.sort(date_column).writeTo(table_name).append()
  1. Read the table using PyIceberg:
catalog = load_glue("default", {})
table = catalog.load_table('...')

scan = table.scan(
    row_filter=EqualTo("date", date_as_string),
)
result = scan.to_arrow()

The result variable contains correct data.

  1. Compact the table files using the OPTIMIZE instruction in AWS Athena. https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg-data-optimization.html
OPTIMIZE table_name REWRITE DATA USING BIN_PACK WHERE date = 'date_as_string'
  1. Optionally, VACUUM the table. It doesn't matter and doesn't change the behavior in any way.

  2. Query the table using the same PyIceberg code as in step 3.

  3. to_arrow raises an exception: ValueError: Iceberg schema is not embedded into the Parquet file, see https://github.com/apache/iceberg/issues/6505

  4. The table can still be accessed correctly in AWS Athena.

Expected behavior

In step 7, the code should work correctly and return the same results as the code in step 3.

Dependency versions

Writing data (step 2)

  • pyarrow: 11.0.0
  • pyspark: 3.3.1
  • iceberg-spark-runtime-3.3_2.12-1.1.0.jar

Reading data (steps 3 and 7):

pyiceberg.__version__
'0.3.0'

pyarrow.__version__
'10.0.1'
@Fokko
Copy link
Contributor

Fokko commented Apr 28, 2023

Thanks @mikulskibartosz for reporting this. Kudo's for the comprehensive issue. This is a known issue that we're working on and will be fixed in the next release: #6647

@rdblue
Copy link
Contributor

rdblue commented May 2, 2023

Just merged #6505, which should address this.

@rdblue rdblue closed this as completed May 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants