-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(planner): Allowing setting sort order of parquet files without specifying the schema #12466
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @devanbenz -- I am sorry I thought i had left a review of thsi PR before but apparently I had not hit submit
datafusion/sql/src/statement.rs
Outdated
@@ -1028,8 +1030,26 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> { | |||
.into_iter() | |||
.collect(); | |||
|
|||
let schema = self.build_schema(columns)?; | |||
let df_schema = schema.to_dfschema_ref()?; | |||
let df_schema = match file_type.as_str() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sorry for the delayed feeback @devanbenz -- I swear I typed this feedback but i must not have clicked "submit"
Basically my concerns about this approach are twofold:
- This code assumes the parquet file is on the local filesystem (when for many systems it may be on remote object storage)
- It also adds a dependency in sql parsing to the parquet format. Since
parquet
has quite a few dependencies, this new dependency is likely non ideal for systems that are using DataFusion for sql parsing (like dask-sql for example)
Perhaps you could delay the creation of the ORDER BY until the table provider is resolved?
The table provider: https://github.com/apache/datafusion/blob/2521043ddcb3895a2010b8e328f3fa10f77fc094/datafusion/expr/src/planner.rs#L35-L34
Once the table provider is resolved then the schema's table can be known
Another benefit of this approach is that it would work for all formats, not just parquet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sorry for the delayed feeback @devanbenz -- I swear I typed this feedback but i must not have clicked "submit"
Thats alright -happens to me all the time 😅
Perhaps you could delay the creation of the ORDER BY until the table provider is resolved?
Sounds good, I like this idea. 👍
Converting to a draft until I have the final implementation done 👍 |
…ecifying the schema This PR allows for the following SQL query to be passed without a schema create external table cpu stored as parquet location 'cpu.parquet' with order (time); closes apache#7317
986dafe
to
2b39944
Compare
@alamb I have this working but I'm unsure if the original implementation is working as expected. Shouldn't the times be descending in this first selection?
Note: this is using the already existing code EDIT: Reading through the documentation I see:
So it sounds like this is working as expected then 🫡 |
datafusion/sql/src/statement.rs
Outdated
); | ||
let mut results = vec![]; | ||
for expr in order_exprs { | ||
for ordered_expr in expr { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've noticed that a lot of this codebase prefers maps
over for loops like this. I personally think for loops are easier to read but I can modify this to use a map and collect instead if that its preferred. Not sure if either is "more performant".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be nice to use map
/ collect
to follow the same conventions, but I don't think it is required
It also took me a while to get used to the map/collect pattern. At first I thought it was just functional language hipster stuff, but then I realized that it is often a key optimization (When possible, collect
can figure out how but the target container is and do a single allocation rather than having to grow)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good I think that I'll modify this to use a map/collect so I can be hip (and get a single allocation) 😎
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank yoU @devanbenz -- I had this checked out to review and I wanted to write a few more tests. Rather than explaining the tests in words and then having you have to push them I figured I would just push them directly
datafusion/sql/src/statement.rs
Outdated
); | ||
let mut results = vec![]; | ||
for expr in order_exprs { | ||
for ordered_expr in expr { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be nice to use map
/ collect
to follow the same conventions, but I don't think it is required
It also took me a while to get used to the map/collect pattern. At first I thought it was just functional language hipster stuff, but then I realized that it is often a key optimization (When possible, collect
can figure out how but the target container is and do a single allocation rather than having to grow)
// specifically for parquet file format. | ||
// See: https://github.com/apache/datafusion/issues/7317 | ||
None => { | ||
let schema = options.infer_schema(session_state, &table_path).await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. Thank you @devanbenz -- perfect
|
||
# query should fail with bad column | ||
statement error | ||
CREATE EXTERNAL TABLE t STORED AS parquet LOCATION '../../parquet-testing/data/alltypes_plain.parquet' WITH ORDER (foo); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another reason this will fail is that there is already a table named t
-- so it is probably good to check the actual error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
||
# query should succeed | ||
statement ok | ||
CREATE EXTERNAL TABLE t STORED AS parquet LOCATION '../../parquet-testing/data/alltypes_plain.parquet' WITH ORDER (id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add a test that shows the table is actually ordered correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can do
statement ok | ||
CREATE EXTERNAL TABLE t STORED AS parquet LOCATION '../../parquet-testing/data/alltypes_plain.parquet' WITH ORDER (id DESC NULLS FIRST); | ||
|
||
## Verify that the table is created with a sort order. Explain should show output_ordering=[id@0 DESC NULLS FIRST] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this test shows one small bug (the output ordering is DESC
not DESC NULLS FIRST
)
I think this is due to the fact that this PR computes nulls first like this:
let nulls_first = ordered_expr.nulls_first.unwrap_or(true);
But the SQL planner computes it like this:
datafusion/datafusion/sql/src/expr/order_by.rs
Lines 105 to 107 in 94d178e
// when asc is true, by default nulls last to be consistent with postgres | |
// postgres rule: https://www.postgresql.org/docs/current/queries-order.html | |
nulls_first.unwrap_or(!asc), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again @devanbenz -- I think there is one very small bug to fix in this PR, but then it will be good to go.
I am happy to fix the bug too and push to your branch; Just let me know
Another test I thought of is testing ordering with more than one column |
Co-authored-by: Andrew Lamb <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love it -- thank you so much @devanbenz
…pecifying the schema (apache#12466) * fix(planner): Allowing setting sort order of parquet files without specifying the schema This PR allows for the following SQL query to be passed without a schema create external table cpu stored as parquet location 'cpu.parquet' with order (time); closes apache#7317 * chore: fmt'ing * fix: fmt * fix: remove test that checks for error with schema * Add some more tests * fix: use !asc Co-authored-by: Andrew Lamb <[email protected]> * feat: clean up some testing and modify statement when building order by expr --------- Co-authored-by: Andrew Lamb <[email protected]>
Which issue does this PR close?
Closes #7317
Rationale for this change
This allows for setting the order upon creation of tables using parquet files without having to specify the schema. Since parquet already has the schema readily available in the metadata this is a relatively quick fix that will enable downstream usage to be less cumbersome, specifically, when setting up reproduction of issues.
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?