Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allowing setting sort order of parquet files without specifying the schema #7317

Closed
alamb opened this issue Aug 17, 2023 · 13 comments · Fixed by #12466
Closed

Allowing setting sort order of parquet files without specifying the schema #7317

alamb opened this issue Aug 17, 2023 · 13 comments · Fixed by #12466
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@alamb
Copy link
Contributor

alamb commented Aug 17, 2023

Is your feature request related to a problem or challenge?

This is a follow on to #7036

As @bmmeijers says in #7036, datafusion can make much better plans if you tell it about the sort order of files.

It is possible now to specify the order of a parquet file

$ datafusion-cli
DataFusion CLI v29.0.0
❯ create external table cpu(time timestamp) stored as parquet location 'cpu.parquet' with order (time desc);
0 rows in set. Query took 0.001 seconds.

❯ select * from cpu;
+---------------------+
| time                |
+---------------------+
| 2022-09-30T12:55:00 |
+---------------------+
1 row in set. Query took 0.003 seconds.

❯ explain select * from cpu order by time desc;
+---------------+-----------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                        |
+---------------+-----------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Sort: cpu.time DESC NULLS FIRST                                                                                             |
|               |   TableScan: cpu projection=[time]                                                                                          |
| physical_plan | ParquetExec: file_groups={1 group: [[Users/alamb/Downloads/cpu.parquet]]}, projection=[time], output_ordering=[time@0 DESC] |
|               |                                                                                                                             |
+---------------+-----------------------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.001 seconds.

However, it is not possible to specify the time without also specifying all of the schema, which is redundant given the schema is stored in the parquet files:

❯ create external table cpu stored as parquet location 'cpu.parquet' with order (time desc);
Error during planning: Provide a schema before specifying the order while creating a table.

Even though DataFusion can infer the schema automatically

❯ create external table cpu stored as parquet location 'cpu.parquet';
0 rows in set. Query took 0.002 seconds.

❯ select * from cpu;
+-----+---------------------+
| v   | time                |
+-----+---------------------+
| 1.0 | 2023-03-01T00:00:00 |
| 2.0 | 2023-03-02T00:00:00 |
+-----+---------------------+
2 rows in set. Query took 0.002 seconds.

Describe the solution you'd like

I would like to be able to specify the sort order for parquet files without also specifying the schema

Given this parquet file: cpu.zip

I would like this to work and produce a table both columns v and time ordered by time:

❯ create external table cpu stored as parquet location 'cpu.parquet' with order (time);
Error during planning: Provide a schema before specifying the order while creating a table.

Describe alternatives you've considered

No response

Additional context

No response

@alamb
Copy link
Contributor Author

alamb commented Aug 17, 2023

I think this is a good first issue as all the code exists, it is just a matter of hooking it up and writing some tests (I think)

@parkma99
Copy link
Contributor

parkma99 commented Aug 25, 2023

Hello , I would like work this ticket to prepare #7354 . But I find it's a little hard to hook all code up.

https://github.com/apache/arrow-datafusion/blob/ea9144e6597593c09a4ef0b71a4da1cdfaca8249/datafusion/sql/src/statement.rs#L660-L665

in here raise the plan error Provide a schema before specifying the order while creating a table.

https://github.com/apache/arrow-datafusion/blob/ea9144e6597593c09a4ef0b71a4da1cdfaca8249/datafusion/core/src/datasource/listing_table_factory.rs#L187-L190

in here infer the schema.

I would like to move build_order_by after infer the schema, but build_order_by is member function of SqlToRel

I would like to move infer the schema before build_order_by, but infer_schema need SessionState .

@akoshchiy
Copy link
Contributor

akoshchiy commented Aug 25, 2023

Hi! I also had the plans to try it, but you were faster :) My idea was to completely remove the schema field checks in build_order_by and move that checks after schema infer.

@parkma99
Copy link
Contributor

Have a try 😄, I just confusing how to decouple this code.

@akoshchiy
Copy link
Contributor

yeah, it was too naive :) Now I see that build_order_by requires a lot of stuff from the DFSchema.

@parkma99
Copy link
Contributor

parkma99 commented Sep 6, 2023

@akoshchiy Would you have some ideas ? 😊😊

@judahrand
Copy link

It almost seems like the datasource module should live in datafusion-common?

@judahrand
Copy link

@alamb What do you think of splitting file_format out into its own crate similar to what you're doing with physical_plan? The infer_schema method is really needed in datafusion-sql to make this issue solvable (I think).

@alamb
Copy link
Contributor Author

alamb commented Sep 11, 2023

It almost seems like the datasource module should live in datafusion-common?

I think this would be very challenging as the datasource module has physical plans in it as well

@alamb What do you think of splitting file_format out into its own crate similar to what you're doing with

@judahrand -- If you are referring to https://github.com/apache/arrow-datafusion/tree/main/datafusion/core/src/datasource/file_format it may be challenging given that it depends on ExecutionPlan (which is in datafusion-core at the moment)

I think the key dependency is that FileScanConfig has embedded PhysicalExpr which are not yet created during planning

https://github.com/apache/arrow-datafusion/blob/4abae3b4fadeadc8a368155e14186016117529c8/datafusion/core/src/datasource/physical_plan/file_scan_config.rs#L73C1-L108

What I would recommend is updating the FileScanConfig if possible (or making an equivalent in LogicalPlan) so that it represents sort order in terms of Expr (the logical expressions) and then converting Expr --> PhysicalExpr during physical planning

@alamb
Copy link
Contributor Author

alamb commented Jan 18, 2024

FYY @helenosheaa I think implementing this DataFusion feature would help us potentially reproduce issues more easily downstream in IOx (as we could use datafusion-cli to scan files more closely to the way the IOx querier and compactor do)

@devanbenz
Copy link
Contributor

Appears no one is working on this currently and the error is still occurring as of today:

DataFusion CLI v41.0.0
> create external table cpu stored as parquet location 'cpu.parquet' with order (time);
Error during planning: Provide a schema before specifying the order while creating a table.

I'll go ahead and take this

@devanbenz
Copy link
Contributor

take

@alamb
Copy link
Contributor Author

alamb commented Sep 15, 2024

Thank you @devanbenz -- this will actually be super valuable to InfluxData as well (as we heavily rely on sorted parquet, but creating reproducers can't be done with pure-sql)

devanbenz added a commit to devanbenz/datafusion that referenced this issue Sep 19, 2024
…ecifying the schema

This PR allows for the following SQL query to be passed without a schema

create external table cpu stored as parquet location 'cpu.parquet' with order (time);

closes apache#7317
@alamb alamb closed this as completed in 515a64e Sep 21, 2024
bgjackma pushed a commit to bgjackma/datafusion that referenced this issue Sep 25, 2024
…pecifying the schema (apache#12466)

* fix(planner): Allowing setting sort order of parquet files without specifying the schema
This PR allows for the following SQL query to be passed without a schema

create external table cpu stored as parquet location 'cpu.parquet' with order (time);

closes apache#7317

* chore: fmt'ing

* fix: fmt

* fix: remove test that checks for error with schema

* Add some more tests

* fix: use !asc

Co-authored-by: Andrew Lamb <[email protected]>

* feat: clean up some testing and modify statement when building order by expr

---------

Co-authored-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
5 participants