Issue reading iceberg tables written by Athena with pyiceberg #6647

nicor88 · 2023-01-23T09:28:11Z

Apache Iceberg version

None

Query engine

Athena v3

Please describe the bug 🐞

I'm trying to read an iceberg table written by Athena (engine v3), not sure which iceberg version it uses.

When running this code:

from pyiceberg import catalog
from pyiceberg.expressions import GreaterThanOrEqual


glue_catalog = catalog.load_glue(name='glue', conf={})

glue_catalog.list_namespaces()


glue_catalog.list_tables('data_engineering')

table = glue_catalog.load_table("data_engineering.iceberg_example_1")
scan = table.scan()
files = [task.file.file_path for task in scan.plan_files()]

print(files)
df_iceberg = scan.to_pandas()
print(len(df_iceberg))

If fails on the df_iceberg = scan.to_pandas() (I tried also with scan.to_arrow().
I'm able to list all the files belonging to the table, therefore this files = [task.file.file_path for task in scan.plan_files()] works.

The error is the following:

Traceback (most recent call last):
  File "/Users/nicor88/deng-swiss-knife/icerberg/get_data.py", line 31, in <module>
    df_iceberg = scan.to_arrow()
  File "/Users/nicor88/deng-swiss-knife/venv/lib/python3.9/site-packages/pyiceberg/table/__init__.py", line 341, in to_arrow
    return project_table(
  File "/Users/nicor88/deng-swiss-knife/venv/lib/python3.9/site-packages/pyiceberg/io/pyarrow.py", line 508, in project_table
    schema_raw = parquet_schema.metadata.get(ICEBERG_SCHEMA)
AttributeError: 'NoneType' object has no attribute 'get'

an example table can be created like that:

create table
    data_engineering.iceberg_example_1
  with (
    table_type='iceberg',
    is_external=false,
    location='s3://xxxx/iceberg_1',
    partitioning=ARRAY['creation_date', 'bucket(user_id, 5)'],
    format='parquet',
    vacuum_max_snapshot_age_seconds=86400,
    optimize_rewrite_delete_file_threshold=2
  )
  as
    

with data as (
    select
        1 as user_id,
        'pi' as user_name,
        'active' as status,
        17.89 as cost,
        1 as quantity,
        100000000 as quantity_big,
        cast(cast(from_unixtime(to_unixtime(now())) as timestamp(6)) as date) as creation_date,
        cast(from_unixtime(to_unixtime(now())) as timestamp(6)) as created_at,
        cast(from_unixtime(to_unixtime(now())) as timestamp(6)) as updated_at
    union all
    select
        2 as user_id,
        'beta' as user_name,
        'inactive' as status,
        3 as cost,
        5 as quantity,
        100000000 as quantity_big,
        cast(cast(from_unixtime(to_unixtime(now())) as timestamp(6)) as date) as creation_date,
        cast(from_unixtime(to_unixtime(now())) as timestamp(6)) as created_at,
        cast(from_unixtime(to_unixtime(now())) as timestamp(6)) as updated_at
)

select
    user_id,
    user_name,
    status,
    cost,
    quantity,
    quantity_big,
    creation_date,
    created_at,
    cast(from_unixtime(to_unixtime(now())) as timestamp(6)) as inserted_at
from data

The text was updated successfully, but these errors were encountered:

amogh-jahagirdar · 2023-01-23T22:31:41Z

Need to investigate more deeply but based on the stack trace it seems like the metadata field on the parquet_schema is not even defined. Here's where pyarrow returns the schema https://github.com/apache/arrow/blob/master/python/pyarrow/parquet/core.py#L3656

Not entirely sure what this metadata field is in parquet_schema and if every parquet writer is expected to write this out.

I'll double check the Parquet spec, but If I had to hazard a guess, I would say it's probably not required to write out and we need to perform the same check done here https://github.com/apache/iceberg/blob/master/python/pyiceberg/io/pyarrow.py#L509
and raise an error linking to this known issue #6505
that PyIceberg should derive the Iceberg schema from the actual Parquet schema.

amogh-jahagirdar · 2023-01-24T00:24:09Z

https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1022

Yeah the file key value metadata (which after stepping through the code is the metadata which is None here) is not required to be written based on the Parquet spec.

Ultimately, it still seems the proper solution is if k/v metadata is not defined or the iceberg schema property in the k/v metadata is not defined we derive the schema #6505 @Fokko @JonasJ-ap

That change maybe complex in the interim, so maybe we still want to raise a PR which surfaces the existing failure when metadata is not defined? I'l raise that and we can close it if we determine we want to just wait for the proper solution for the next release. I think it'll be helpful for users to at least see it's a known issue until it's fixed.

rdblue · 2023-02-17T20:27:51Z

@Fokko, the solution here is to convert the Parquet schema to Iceberg and preserve the IDs right?

Fokko · 2023-02-19T20:57:24Z

@rdblue that's correct.

I took a shortcut and get the schema from the metadata. That should still be the preferred way because it is probably faster (Pydantic without any GIL just going through the JSON document).

However, the fields in Parquet also have metadata, and those contain the field_id:

➜  Desktop python3
Python 3.10.10 (main, Feb  8 2023, 05:34:50) [Clang 14.0.0 (clang-1400.0.29.202)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from pyarrow.fs import LocalFileSystem
>>> import pyarrow.parquet as pq
>>> 
>>> fs = LocalFileSystem()
>>> 
>>> path = "/Users/fokkodriesprong/Desktop/00003-4-5e18fe8b-532b-42ec-bdb9-0c7ba7078155-00001.parquet"
>>> 
>>> with fs.open_input_file(path) as fout:
...     parquet_schema = pq.read_schema(fout)
... 
>>> 
>>> parquet_schema
VendorID: int64
  -- field metadata --
  PARQUET:field_id: '1'
tpep_pickup_datetime: timestamp[us, tz=UTC]
  -- field metadata --
  PARQUET:field_id: '2'
tpep_dropoff_datetime: timestamp[us, tz=UTC]
  -- field metadata --
  PARQUET:field_id: '3'
passenger_count: double
  -- field metadata --
  PARQUET:field_id: '4'
trip_distance: double
  -- field metadata --
  PARQUET:field_id: '5'
RatecodeID: double
  -- field metadata --
  PARQUET:field_id: '6'
store_and_fwd_flag: string
  -- field metadata --
  PARQUET:field_id: '7'
PULocationID: int64
  -- field metadata --
  PARQUET:field_id: '8'
DOLocationID: int64
  -- field metadata --
  PARQUET:field_id: '9'
payment_type: int64
  -- field metadata --
  PARQUET:field_id: '10'
fare_amount: double
  -- field metadata --
  PARQUET:field_id: '11'
extra: double
  -- field metadata --
  PARQUET:field_id: '12'
mta_tax: double
  -- field metadata --
  PARQUET:field_id: '13'
tip_amount: double
  -- field metadata --
  PARQUET:field_id: '14'
tolls_amount: double
  -- field metadata --
  PARQUET:field_id: '15'
improvement_surcharge: double
  -- field metadata --
  PARQUET:field_id: '16'
total_amount: double
  -- field metadata --
  PARQUET:field_id: '17'
congestion_surcharge: double
  -- field metadata --
  PARQUET:field_id: '18'
airport_fee: double
  -- field metadata --
  PARQUET:field_id: '19'
-- schema metadata --
iceberg.schema: '{"type":"struct","schema-id":0,"fields":[{"id":1,"name":' + 1236

@amogh-jahagirdar The idea is to implement a visitor to convert this Arrow schema to an Iceberg one with the correct IDs.

sheinbergon · 2023-04-13T22:49:16Z

@rdblue @Fokko just wondering - when is this feature/fix due? I see work in #6505 had already taken place, but not sure where things stand. I care greatly about the ability to read externally generated iceberg tables natively via Python. Is there anyway I can contribute or help with making 0.4.0 happen and have the fix for this issue included?

Fokko · 2023-04-28T10:07:13Z

@sheinbergon We're wrapping up the release, and hope to start the release process soon. I would expect within 1-2 weeks.

rdblue · 2023-05-02T00:07:25Z

@sheinbergon, we just merged #6505 which should solve this.

I'm closing this.

nastra added the python label Jan 23, 2023

amogh-jahagirdar mentioned this issue Jan 24, 2023

Python: Check if optional Parquet kv metadata is None before reading Iceberg Schema #6654

Merged

Fokko added this to the Python 0.4.0 release milestone Jan 25, 2023

This was referenced Mar 3, 2023

Python: Infer Iceberg schema from the Parquet file #6997

Merged

Python: Infer Iceberg schema from the Parquet file #6505

Closed

Fokko mentioned this issue Apr 28, 2023

PyIceberg doesn't support tables compacted with AWS Athena #7457

Closed

rdblue closed this as completed May 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue reading iceberg tables written by Athena with pyiceberg #6647

Issue reading iceberg tables written by Athena with pyiceberg #6647

nicor88 commented Jan 23, 2023 •

edited

Loading

amogh-jahagirdar commented Jan 23, 2023

amogh-jahagirdar commented Jan 24, 2023 •

edited

Loading

rdblue commented Feb 17, 2023

Fokko commented Feb 19, 2023 •

edited

Loading

sheinbergon commented Apr 13, 2023 •

edited

Loading

Fokko commented Apr 28, 2023

rdblue commented May 2, 2023

Issue reading iceberg tables written by Athena with pyiceberg #6647

Issue reading iceberg tables written by Athena with pyiceberg #6647

Comments

nicor88 commented Jan 23, 2023 • edited Loading

Apache Iceberg version

Query engine

Please describe the bug 🐞

amogh-jahagirdar commented Jan 23, 2023

amogh-jahagirdar commented Jan 24, 2023 • edited Loading

rdblue commented Feb 17, 2023

Fokko commented Feb 19, 2023 • edited Loading

sheinbergon commented Apr 13, 2023 • edited Loading

Fokko commented Apr 28, 2023

rdblue commented May 2, 2023

nicor88 commented Jan 23, 2023 •

edited

Loading

amogh-jahagirdar commented Jan 24, 2023 •

edited

Loading

Fokko commented Feb 19, 2023 •

edited

Loading

sheinbergon commented Apr 13, 2023 •

edited

Loading