feat(planner): Allowing setting sort order of parquet files without specifying the schema #12466

devanbenz · 2024-09-14T17:09:31Z

Which issue does this PR close?

Rationale for this change

This allows for setting the order upon creation of tables using parquet files without having to specify the schema. Since parquet already has the schema readily available in the metadata this is a relatively quick fix that will enable downstream usage to be less cumbersome, specifically, when setting up reproduction of issues.

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb

Thank you @devanbenz -- I am sorry I thought i had left a review of thsi PR before but apparently I had not hit submit

alamb · 2024-09-16T18:08:51Z

datafusion/sql/src/statement.rs

@@ -1028,8 +1030,26 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
            .into_iter()
            .collect();

-        let schema = self.build_schema(columns)?;
-        let df_schema = schema.to_dfschema_ref()?;
+        let df_schema = match file_type.as_str() {


I am sorry for the delayed feeback @devanbenz -- I swear I typed this feedback but i must not have clicked "submit"

Basically my concerns about this approach are twofold:

This code assumes the parquet file is on the local filesystem (when for many systems it may be on remote object storage)

It also adds a dependency in sql parsing to the parquet format. Since parquet has quite a few dependencies, this new dependency is likely non ideal for systems that are using DataFusion for sql parsing (like dask-sql for example)

Perhaps you could delay the creation of the ORDER BY until the table provider is resolved?

The table provider: https://github.com/apache/datafusion/blob/2521043ddcb3895a2010b8e328f3fa10f77fc094/datafusion/expr/src/planner.rs#L35-L34

Once the table provider is resolved then the schema's table can be known

Another benefit of this approach is that it would work for all formats, not just parquet

I am sorry for the delayed feeback @devanbenz -- I swear I typed this feedback but i must not have clicked "submit"

Thats alright -happens to me all the time 😅

Perhaps you could delay the creation of the ORDER BY until the table provider is resolved?

Sounds good, I like this idea. 👍

devanbenz · 2024-09-17T18:29:27Z

Converting to a draft until I have the final implementation done 👍

…ecifying the schema This PR allows for the following SQL query to be passed without a schema create external table cpu stored as parquet location 'cpu.parquet' with order (time); closes apache#7317

devanbenz · 2024-09-19T14:13:15Z

@alamb I have this working but I'm unsure if the original implementation is working as expected. Shouldn't the times be descending in this first selection?

> create external table cpu(time timestamp) stored as parquet location '/Users/devan/Downloads/cpu.parquet' with order (time desc);

0 row(s) fetched. 
Elapsed 0.013 seconds.

> select * from cpu
;
+---------------------+
| time                |
+---------------------+
| 2023-03-01T00:00:00 |
| 2023-03-02T00:00:00 |
+---------------------+
2 row(s) fetched. 
Elapsed 0.018 seconds.

> drop table cpu;
0 row(s) fetched. 
Elapsed 0.002 seconds.

> create external table cpu(time timestamp) stored as parquet location '/Users/devan/Downloads/cpu.parquet' with order (time asc);
0 row(s) fetched. 
Elapsed 0.004 seconds.

> select * from cpu;
+---------------------+
| time                |
+---------------------+
| 2023-03-01T00:00:00 |
| 2023-03-02T00:00:00 |
+---------------------+
2 row(s) fetched. 
Elapsed 0.008 seconds.

Note: this is using the already existing code

EDIT: Reading through the documentation I see:

It’s important to understand that using the WITH ORDER clause in the CREATE EXTERNAL TABLE statement only specifies the order in which the data should be read from the external file. If the data in the file is not already sorted according to the specified order, then the results may not be correct.

So it sounds like this is working as expected then 🫡

devanbenz · 2024-09-19T14:42:05Z

datafusion/sql/src/statement.rs

-            );
+            let mut results = vec![];
+            for expr in order_exprs {
+                for ordered_expr in expr {


I've noticed that a lot of this codebase prefers maps over for loops like this. I personally think for loops are easier to read but I can modify this to use a map and collect instead if that its preferred. Not sure if either is "more performant".

I think it would be nice to use map / collect to follow the same conventions, but I don't think it is required

It also took me a while to get used to the map/collect pattern. At first I thought it was just functional language hipster stuff, but then I realized that it is often a key optimization (When possible, collect can figure out how but the target container is and do a single allocation rather than having to grow)

Sounds good I think that I'll modify this to use a map/collect so I can be hip (and get a single allocation) 😎

alamb

Thank yoU @devanbenz -- I had this checked out to review and I wanted to write a few more tests. Rather than explaining the tests in words and then having you have to push them I figured I would just push them directly

alamb · 2024-09-20T14:56:37Z

datafusion/sql/src/statement.rs

-            );
+            let mut results = vec![];
+            for expr in order_exprs {
+                for ordered_expr in expr {


I think it would be nice to use map / collect to follow the same conventions, but I don't think it is required

It also took me a while to get used to the map/collect pattern. At first I thought it was just functional language hipster stuff, but then I realized that it is often a key optimization (When possible, collect can figure out how but the target container is and do a single allocation rather than having to grow)

alamb · 2024-09-20T14:57:51Z

datafusion/core/src/datasource/listing_table_factory.rs

+            // specifically for parquet file format.
+            // See: https://github.com/apache/datafusion/issues/7317
+            None => {
+                let schema = options.infer_schema(session_state, &table_path).await?;


This looks great. Thank you @devanbenz -- perfect

alamb · 2024-09-20T15:06:17Z

datafusion/sqllogictest/test_files/create_external_table.slt

+
+# query should fail with bad column
+statement error
+CREATE EXTERNAL TABLE t STORED AS parquet LOCATION '../../parquet-testing/data/alltypes_plain.parquet' WITH ORDER (foo);


Another reason this will fail is that there is already a table named t -- so it is probably good to check the actual error

alamb · 2024-09-20T15:06:19Z

datafusion/sqllogictest/test_files/create_external_table.slt

+
+# query should succeed
+statement ok
+CREATE EXTERNAL TABLE t STORED AS parquet LOCATION '../../parquet-testing/data/alltypes_plain.parquet' WITH ORDER (id);


Can you also add a test that shows the table is actually ordered correctly?

alamb · 2024-09-20T15:08:31Z

datafusion/sqllogictest/test_files/create_external_table.slt

+statement ok
+CREATE EXTERNAL TABLE t STORED AS parquet LOCATION '../../parquet-testing/data/alltypes_plain.parquet' WITH ORDER (id DESC NULLS FIRST);
+
+## Verify that the table is created with a sort order. Explain should show output_ordering=[id@0 DESC NULLS FIRST]


I think this test shows one small bug (the output ordering is DESC not DESC NULLS FIRST)

I think this is due to the fact that this PR computes nulls first like this:

let nulls_first = ordered_expr.nulls_first.unwrap_or(true);

But the SQL planner computes it like this:

datafusion/datafusion/sql/src/expr/order_by.rs

Lines 105 to 107 in 94d178e

// when asc is true, by default nulls last to be consistent with postgres

// postgres rule: https://www.postgresql.org/docs/current/queries-order.html

nulls_first.unwrap_or(!asc),

alamb

Thanks again @devanbenz -- I think there is one very small bug to fix in this PR, but then it will be good to go.

I am happy to fix the bug too and push to your branch; Just let me know

datafusion/sql/src/statement.rs

alamb · 2024-09-20T15:26:52Z

Another test I thought of is testing ordering with more than one column WITH ORDER (a, b)

Co-authored-by: Andrew Lamb <[email protected]>

…by expr

alamb

Love it -- thank you so much @devanbenz

…pecifying the schema (apache#12466) * fix(planner): Allowing setting sort order of parquet files without specifying the schema This PR allows for the following SQL query to be passed without a schema create external table cpu stored as parquet location 'cpu.parquet' with order (time); closes apache#7317 * chore: fmt'ing * fix: fmt * fix: remove test that checks for error with schema * Add some more tests * fix: use !asc Co-authored-by: Andrew Lamb <[email protected]> * feat: clean up some testing and modify statement when building order by expr --------- Co-authored-by: Andrew Lamb <[email protected]>

github-actions bot added the sql SQL Planner label Sep 14, 2024

devanbenz marked this pull request as draft September 14, 2024 17:12

devanbenz marked this pull request as ready for review September 14, 2024 19:06

alamb reviewed Sep 16, 2024

View reviewed changes

alamb mentioned this pull request Sep 16, 2024

DataFusion weekly project plan (Andrew Lamb) - Sep 16, 2024 #12494

Closed

8 tasks

devanbenz marked this pull request as draft September 17, 2024 18:28

github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Sep 19, 2024

fix(planner): Allowing setting sort order of parquet files without sp…

2b39944

…ecifying the schema This PR allows for the following SQL query to be passed without a schema create external table cpu stored as parquet location 'cpu.parquet' with order (time); closes apache#7317

devanbenz force-pushed the fix/7317-allow-sort-no-schema branch from 986dafe to 2b39944 Compare September 19, 2024 13:59

devanbenz added 2 commits September 19, 2024 09:01

chore: fmt'ing

8a65625

fix: fmt

356a5b5

fix: remove test that checks for error with schema

a3042a1

devanbenz marked this pull request as ready for review September 19, 2024 14:21

devanbenz requested a review from alamb September 19, 2024 14:21

devanbenz commented Sep 19, 2024

View reviewed changes

Add some more tests

95e0341

alamb reviewed Sep 20, 2024

View reviewed changes

datafusion/sql/src/statement.rs Outdated Show resolved Hide resolved

devanbenz and others added 2 commits September 20, 2024 10:27

fix: use !asc

6d432a3

Co-authored-by: Andrew Lamb <[email protected]>

feat: clean up some testing and modify statement when building order …

fc59587

…by expr

alamb approved these changes Sep 20, 2024

View reviewed changes

alamb merged commit 515a64e into apache:main Sep 21, 2024
24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(planner): Allowing setting sort order of parquet files without specifying the schema #12466

feat(planner): Allowing setting sort order of parquet files without specifying the schema #12466

devanbenz commented Sep 14, 2024 •

edited

Loading

alamb left a comment

alamb Sep 16, 2024

devanbenz Sep 16, 2024

devanbenz commented Sep 17, 2024

devanbenz commented Sep 19, 2024 •

edited

Loading

devanbenz Sep 19, 2024

alamb Sep 20, 2024

devanbenz Sep 20, 2024

alamb left a comment

alamb Sep 20, 2024

alamb Sep 20, 2024

alamb Sep 20, 2024

devanbenz Sep 20, 2024

alamb Sep 20, 2024

devanbenz Sep 20, 2024

alamb Sep 20, 2024

alamb left a comment

alamb commented Sep 20, 2024

alamb left a comment

	// when asc is true, by default nulls last to be consistent with postgres
	// postgres rule: https://www.postgresql.org/docs/current/queries-order.html
	nulls_first.unwrap_or(!asc),

feat(planner): Allowing setting sort order of parquet files without specifying the schema #12466

feat(planner): Allowing setting sort order of parquet files without specifying the schema #12466

Conversation

devanbenz commented Sep 14, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

devanbenz commented Sep 17, 2024

devanbenz commented Sep 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb commented Sep 20, 2024

alamb left a comment

Choose a reason for hiding this comment

devanbenz commented Sep 14, 2024 •

edited

Loading

devanbenz commented Sep 19, 2024 •

edited

Loading