-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue reading iceberg tables written by Athena with pyiceberg #6647
Comments
Need to investigate more deeply but based on the stack trace it seems like the metadata field on the parquet_schema is not even defined. Here's where pyarrow returns the schema https://github.com/apache/arrow/blob/master/python/pyarrow/parquet/core.py#L3656 Not entirely sure what this metadata field is in parquet_schema and if every parquet writer is expected to write this out. I'll double check the Parquet spec, but If I had to hazard a guess, I would say it's probably not required to write out and we need to perform the same check done here https://github.com/apache/iceberg/blob/master/python/pyiceberg/io/pyarrow.py#L509 |
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1022 Yeah the file key value metadata (which after stepping through the code is the metadata which is None here) is not required to be written based on the Parquet spec. Ultimately, it still seems the proper solution is if k/v metadata is not defined or the iceberg schema property in the k/v metadata is not defined we derive the schema #6505 @Fokko @JonasJ-ap That change maybe complex in the interim, so maybe we still want to raise a PR which surfaces the existing failure when metadata is not defined? I'l raise that and we can close it if we determine we want to just wait for the proper solution for the next release. I think it'll be helpful for users to at least see it's a known issue until it's fixed. |
@Fokko, the solution here is to convert the Parquet schema to Iceberg and preserve the IDs right? |
@rdblue that's correct. I took a shortcut and get the schema from the metadata. That should still be the preferred way because it is probably faster (Pydantic without any GIL just going through the JSON document). However, the fields in Parquet also have metadata, and those contain the ➜ Desktop python3
Python 3.10.10 (main, Feb 8 2023, 05:34:50) [Clang 14.0.0 (clang-1400.0.29.202)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from pyarrow.fs import LocalFileSystem
>>> import pyarrow.parquet as pq
>>>
>>> fs = LocalFileSystem()
>>>
>>> path = "/Users/fokkodriesprong/Desktop/00003-4-5e18fe8b-532b-42ec-bdb9-0c7ba7078155-00001.parquet"
>>>
>>> with fs.open_input_file(path) as fout:
... parquet_schema = pq.read_schema(fout)
...
>>>
>>> parquet_schema
VendorID: int64
-- field metadata --
PARQUET:field_id: '1'
tpep_pickup_datetime: timestamp[us, tz=UTC]
-- field metadata --
PARQUET:field_id: '2'
tpep_dropoff_datetime: timestamp[us, tz=UTC]
-- field metadata --
PARQUET:field_id: '3'
passenger_count: double
-- field metadata --
PARQUET:field_id: '4'
trip_distance: double
-- field metadata --
PARQUET:field_id: '5'
RatecodeID: double
-- field metadata --
PARQUET:field_id: '6'
store_and_fwd_flag: string
-- field metadata --
PARQUET:field_id: '7'
PULocationID: int64
-- field metadata --
PARQUET:field_id: '8'
DOLocationID: int64
-- field metadata --
PARQUET:field_id: '9'
payment_type: int64
-- field metadata --
PARQUET:field_id: '10'
fare_amount: double
-- field metadata --
PARQUET:field_id: '11'
extra: double
-- field metadata --
PARQUET:field_id: '12'
mta_tax: double
-- field metadata --
PARQUET:field_id: '13'
tip_amount: double
-- field metadata --
PARQUET:field_id: '14'
tolls_amount: double
-- field metadata --
PARQUET:field_id: '15'
improvement_surcharge: double
-- field metadata --
PARQUET:field_id: '16'
total_amount: double
-- field metadata --
PARQUET:field_id: '17'
congestion_surcharge: double
-- field metadata --
PARQUET:field_id: '18'
airport_fee: double
-- field metadata --
PARQUET:field_id: '19'
-- schema metadata --
iceberg.schema: '{"type":"struct","schema-id":0,"fields":[{"id":1,"name":' + 1236 @amogh-jahagirdar The idea is to implement a visitor to convert this Arrow schema to an Iceberg one with the correct IDs. |
@rdblue @Fokko just wondering - when is this feature/fix due? I see work in #6505 had already taken place, but not sure where things stand. I care greatly about the ability to read externally generated iceberg tables natively via Python. Is there anyway I can contribute or help with making 0.4.0 happen and have the fix for this issue included? |
@sheinbergon We're wrapping up the release, and hope to start the release process soon. I would expect within 1-2 weeks. |
@sheinbergon, we just merged #6505 which should solve this. I'm closing this. |
Apache Iceberg version
None
Query engine
Athena v3
Please describe the bug 🐞
I'm trying to read an iceberg table written by Athena (engine v3), not sure which iceberg version it uses.
When running this code:
If fails on the
df_iceberg = scan.to_pandas()
(I tried also withscan.to_arrow()
.I'm able to list all the files belonging to the table, therefore this
files = [task.file.file_path for task in scan.plan_files()]
works.The error is the following:
an example table can be created like that:
The text was updated successfully, but these errors were encountered: