-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect Parquet Projection For Nested Types #2453
Comments
apache/arrow-rs#1716 contains a new API that should fix this |
Now arrow-rs is released 15 version will datafusion upgrade to that so this bug can be closed? |
I can confirm #2631 closes this issue, although it should be noted it now runs into a different issue that will need to be triaged and fixed
However, if you disable stats collection it works correctly 🎉
I will file a follow on ticket. |
For some reason if I use SELECT * FROM patient it's not getting data for all the columns. It got only few columns data but other non Struct columns got empty data. Is this because of setting collect_stat to false? |
Is this using the arrow 15 branch - #2631 ? I would expect this behaviour from master, where projection, including no projection, is broken. However, it appears to be working correctly on the arrow 15 branch?
|
Describe the bug
Where part-00000-f6337bce-7fcd-4021-9f9d-040413ea83f8-c000.snappy.parquet is the parquet file provided by @kesavkolla in #2439
This fails with
The problem arises because ParquetExec is passing the projection indices for the arrow schema to get_record_reader_by_columns which instead expects parquet column indexes. In the presence of nested types, these are not the same thing.
This is further complicated by apache/arrow-rs#1652 and apache/arrow-rs#1651
To Reproduce
Run the code above
Expected behavior
The code should not error
The text was updated successfully, but these errors were encountered: