Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue reading iceberg tables written by Athena with pyiceberg #6647

Closed
nicor88 opened this issue Jan 23, 2023 · 7 comments
Closed

Issue reading iceberg tables written by Athena with pyiceberg #6647

nicor88 opened this issue Jan 23, 2023 · 7 comments

Comments

@nicor88
Copy link

nicor88 commented Jan 23, 2023

Apache Iceberg version

None

Query engine

Athena v3

Please describe the bug 🐞

I'm trying to read an iceberg table written by Athena (engine v3), not sure which iceberg version it uses.

When running this code:

from pyiceberg import catalog
from pyiceberg.expressions import GreaterThanOrEqual


glue_catalog = catalog.load_glue(name='glue', conf={})

glue_catalog.list_namespaces()


glue_catalog.list_tables('data_engineering')

table = glue_catalog.load_table("data_engineering.iceberg_example_1")
scan = table.scan()
files = [task.file.file_path for task in scan.plan_files()]

print(files)
df_iceberg = scan.to_pandas()
print(len(df_iceberg))

If fails on the df_iceberg = scan.to_pandas() (I tried also with scan.to_arrow().
I'm able to list all the files belonging to the table, therefore this files = [task.file.file_path for task in scan.plan_files()] works.

The error is the following:

Traceback (most recent call last):
  File "/Users/nicor88/deng-swiss-knife/icerberg/get_data.py", line 31, in <module>
    df_iceberg = scan.to_arrow()
  File "/Users/nicor88/deng-swiss-knife/venv/lib/python3.9/site-packages/pyiceberg/table/__init__.py", line 341, in to_arrow
    return project_table(
  File "/Users/nicor88/deng-swiss-knife/venv/lib/python3.9/site-packages/pyiceberg/io/pyarrow.py", line 508, in project_table
    schema_raw = parquet_schema.metadata.get(ICEBERG_SCHEMA)
AttributeError: 'NoneType' object has no attribute 'get'

an example table can be created like that:

create table
    data_engineering.iceberg_example_1
  with (
    table_type='iceberg',
    is_external=false,
    location='s3://xxxx/iceberg_1',
    partitioning=ARRAY['creation_date', 'bucket(user_id, 5)'],
    format='parquet',
    vacuum_max_snapshot_age_seconds=86400,
    optimize_rewrite_delete_file_threshold=2
  )
  as
    

with data as (
    select
        1 as user_id,
        'pi' as user_name,
        'active' as status,
        17.89 as cost,
        1 as quantity,
        100000000 as quantity_big,
        cast(cast(from_unixtime(to_unixtime(now())) as timestamp(6)) as date) as creation_date,
        cast(from_unixtime(to_unixtime(now())) as timestamp(6)) as created_at,
        cast(from_unixtime(to_unixtime(now())) as timestamp(6)) as updated_at
    union all
    select
        2 as user_id,
        'beta' as user_name,
        'inactive' as status,
        3 as cost,
        5 as quantity,
        100000000 as quantity_big,
        cast(cast(from_unixtime(to_unixtime(now())) as timestamp(6)) as date) as creation_date,
        cast(from_unixtime(to_unixtime(now())) as timestamp(6)) as created_at,
        cast(from_unixtime(to_unixtime(now())) as timestamp(6)) as updated_at
)

select
    user_id,
    user_name,
    status,
    cost,
    quantity,
    quantity_big,
    creation_date,
    created_at,
    cast(from_unixtime(to_unixtime(now())) as timestamp(6)) as inserted_at
from data
@nastra nastra added the python label Jan 23, 2023
@amogh-jahagirdar
Copy link
Contributor

Need to investigate more deeply but based on the stack trace it seems like the metadata field on the parquet_schema is not even defined. Here's where pyarrow returns the schema https://github.com/apache/arrow/blob/master/python/pyarrow/parquet/core.py#L3656

Not entirely sure what this metadata field is in parquet_schema and if every parquet writer is expected to write this out.

I'll double check the Parquet spec, but If I had to hazard a guess, I would say it's probably not required to write out and we need to perform the same check done here https://github.com/apache/iceberg/blob/master/python/pyiceberg/io/pyarrow.py#L509
and raise an error linking to this known issue #6505
that PyIceberg should derive the Iceberg schema from the actual Parquet schema.

@amogh-jahagirdar
Copy link
Contributor

amogh-jahagirdar commented Jan 24, 2023

https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1022

Yeah the file key value metadata (which after stepping through the code is the metadata which is None here) is not required to be written based on the Parquet spec.

Ultimately, it still seems the proper solution is if k/v metadata is not defined or the iceberg schema property in the k/v metadata is not defined we derive the schema #6505 @Fokko @JonasJ-ap

That change maybe complex in the interim, so maybe we still want to raise a PR which surfaces the existing failure when metadata is not defined? I'l raise that and we can close it if we determine we want to just wait for the proper solution for the next release. I think it'll be helpful for users to at least see it's a known issue until it's fixed.

@rdblue
Copy link
Contributor

rdblue commented Feb 17, 2023

@Fokko, the solution here is to convert the Parquet schema to Iceberg and preserve the IDs right?

@Fokko
Copy link
Contributor

Fokko commented Feb 19, 2023

@rdblue that's correct.

I took a shortcut and get the schema from the metadata. That should still be the preferred way because it is probably faster (Pydantic without any GIL just going through the JSON document).

However, the fields in Parquet also have metadata, and those contain the field_id:

Desktop python3
Python 3.10.10 (main, Feb  8 2023, 05:34:50) [Clang 14.0.0 (clang-1400.0.29.202)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from pyarrow.fs import LocalFileSystem
>>> import pyarrow.parquet as pq
>>> 
>>> fs = LocalFileSystem()
>>> 
>>> path = "/Users/fokkodriesprong/Desktop/00003-4-5e18fe8b-532b-42ec-bdb9-0c7ba7078155-00001.parquet"
>>> 
>>> with fs.open_input_file(path) as fout:
...     parquet_schema = pq.read_schema(fout)
... 
>>> 
>>> parquet_schema
VendorID: int64
  -- field metadata --
  PARQUET:field_id: '1'
tpep_pickup_datetime: timestamp[us, tz=UTC]
  -- field metadata --
  PARQUET:field_id: '2'
tpep_dropoff_datetime: timestamp[us, tz=UTC]
  -- field metadata --
  PARQUET:field_id: '3'
passenger_count: double
  -- field metadata --
  PARQUET:field_id: '4'
trip_distance: double
  -- field metadata --
  PARQUET:field_id: '5'
RatecodeID: double
  -- field metadata --
  PARQUET:field_id: '6'
store_and_fwd_flag: string
  -- field metadata --
  PARQUET:field_id: '7'
PULocationID: int64
  -- field metadata --
  PARQUET:field_id: '8'
DOLocationID: int64
  -- field metadata --
  PARQUET:field_id: '9'
payment_type: int64
  -- field metadata --
  PARQUET:field_id: '10'
fare_amount: double
  -- field metadata --
  PARQUET:field_id: '11'
extra: double
  -- field metadata --
  PARQUET:field_id: '12'
mta_tax: double
  -- field metadata --
  PARQUET:field_id: '13'
tip_amount: double
  -- field metadata --
  PARQUET:field_id: '14'
tolls_amount: double
  -- field metadata --
  PARQUET:field_id: '15'
improvement_surcharge: double
  -- field metadata --
  PARQUET:field_id: '16'
total_amount: double
  -- field metadata --
  PARQUET:field_id: '17'
congestion_surcharge: double
  -- field metadata --
  PARQUET:field_id: '18'
airport_fee: double
  -- field metadata --
  PARQUET:field_id: '19'
-- schema metadata --
iceberg.schema: '{"type":"struct","schema-id":0,"fields":[{"id":1,"name":' + 1236

@amogh-jahagirdar The idea is to implement a visitor to convert this Arrow schema to an Iceberg one with the correct IDs.

@sheinbergon
Copy link

sheinbergon commented Apr 13, 2023

@rdblue @Fokko just wondering - when is this feature/fix due? I see work in #6505 had already taken place, but not sure where things stand. I care greatly about the ability to read externally generated iceberg tables natively via Python. Is there anyway I can contribute or help with making 0.4.0 happen and have the fix for this issue included?

@Fokko
Copy link
Contributor

Fokko commented Apr 28, 2023

@sheinbergon We're wrapping up the release, and hope to start the release process soon. I would expect within 1-2 weeks.

@rdblue
Copy link
Contributor

rdblue commented May 2, 2023

@sheinbergon, we just merged #6505 which should solve this.

I'm closing this.

@rdblue rdblue closed this as completed May 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants