-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allowing setting sort order of parquet files without specifying the schema #7317
Comments
I think this is a good first issue as all the code exists, it is just a matter of hooking it up and writing some tests (I think) |
Hello , I would like work this ticket to prepare #7354 . But I find it's a little hard to hook all code up. in here raise the plan error in here infer the schema. I would like to move I would like to move |
Hi! I also had the plans to try it, but you were faster :) My idea was to completely remove the schema field checks in |
Have a try 😄, I just confusing how to decouple this code. |
yeah, it was too naive :) Now I see that |
@akoshchiy Would you have some ideas ? 😊😊 |
It almost seems like the |
@alamb What do you think of splitting |
I think this would be very challenging as the datasource module has physical plans in it as well
@judahrand -- If you are referring to https://github.com/apache/arrow-datafusion/tree/main/datafusion/core/src/datasource/file_format it may be challenging given that it depends on ExecutionPlan (which is in datafusion-core at the moment) I think the key dependency is that FileScanConfig has embedded What I would recommend is updating the FileScanConfig if possible (or making an equivalent in LogicalPlan) so that it represents sort order in terms of |
FYY @helenosheaa I think implementing this DataFusion feature would help us potentially reproduce issues more easily downstream in IOx (as we could use datafusion-cli to scan files more closely to the way the IOx querier and compactor do) |
Appears no one is working on this currently and the error is still occurring as of today:
I'll go ahead and take this |
take |
Thank you @devanbenz -- this will actually be super valuable to InfluxData as well (as we heavily rely on sorted parquet, but creating reproducers can't be done with pure-sql) |
…ecifying the schema This PR allows for the following SQL query to be passed without a schema create external table cpu stored as parquet location 'cpu.parquet' with order (time); closes apache#7317
…pecifying the schema (apache#12466) * fix(planner): Allowing setting sort order of parquet files without specifying the schema This PR allows for the following SQL query to be passed without a schema create external table cpu stored as parquet location 'cpu.parquet' with order (time); closes apache#7317 * chore: fmt'ing * fix: fmt * fix: remove test that checks for error with schema * Add some more tests * fix: use !asc Co-authored-by: Andrew Lamb <[email protected]> * feat: clean up some testing and modify statement when building order by expr --------- Co-authored-by: Andrew Lamb <[email protected]>
Is your feature request related to a problem or challenge?
This is a follow on to #7036
As @bmmeijers says in #7036, datafusion can make much better plans if you tell it about the sort order of files.
It is possible now to specify the order of a parquet file
However, it is not possible to specify the time without also specifying all of the schema, which is redundant given the schema is stored in the parquet files:
Even though DataFusion can infer the schema automatically
Describe the solution you'd like
I would like to be able to specify the sort order for parquet files without also specifying the schema
Given this parquet file: cpu.zip
I would like this to work and produce a table both columns
v
andtime
ordered bytime
:Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: