-
Notifications
You must be signed in to change notification settings - Fork 819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Primitive REPEATED fields not contained in LIST annotated groups aren't read as lists by record reader #6648
Comments
repeated_no_list.parquet.gz The fix yields the proper result: |
TIL, thanks! One question, however. The spec actual says A repeated field that is neither contained by a LIST- or MAP-annotated group nor annotated by LIST or MAP should be interpreted as a required list of required elements where the element type is the type of the field. So is it allowable for the max def level to be 1 here? If the elements are required, shouldn't max_def be 0? TBH I'm not sure why the spec is worded that way...nulls are clearly detectable so I'd think the required vs optional could be deduced from the max def level. In fact, arrow-cpp/pyarrow seems to read the given file just fine (although the arrow schema seems do indicate all elements are not null).
Then again, lists confuse me. So even if required, is it that a REPEATED field has 0 to many entries? |
FYI, We currently meet a similiar problem and issue a discussion here[1], also refer to [2] [1] https://lists.apache.org/thread/s6b25j3x26009v054yqjov0f1z49ctqj |
This is not a normal LIST type specified by the Parquet spec where a LIST annotation on a group type is required. Should we disable writing lists as this in parquet-rs? We should follow the three level list encoding. |
But IIUC this form is permitted by the spec. It just can't be mixed with the LIST annotation. |
I believe this is specific to the record reader, i.e. the non arrow reader, and so have updated the title to reflect this. Let me know if I am mistaken #2394 is also related |
Correct, this is only for the record reader. TBH I totally forgot I've opened the same issue you've mentioned from two years ago :) Every time we encounter such a file with one of our customers I'm looking into it again, finally decided to just fix it. |
This is allowed by the format, other writers sometimes generate such files (e.g. the customer file that tripped our code was created with parquet-mr) |
Thanks @zeevm! I think the provided file is worth adding to https://github.com/apache/parquet-testing as a good example for interoperability test. |
It seems that parquet-cli (backed by parquet-mr) cannot read it:
|
Strange. An old parquet-tools 1.11.1 jar I have can read it, however.
|
|
Describe the bug
Primitive REPEATED fields not contained in LIST annotated groups should be read as lists according to the format but aren't.
To Reproduce
Expected behavior
Additional context
The text was updated successfully, but these errors were encountered: