Allowing to expose sort order of parquet file to datafusion does not work #7036

bmmeijers · 2023-07-20T07:55:40Z

Describe the bug

I am trying to perform a range join on two parquet files with datafusion-cli, but it makes an unrealistic query plan.

select * from points, intervals where pt between first and last order by pt;

I try to express the sort order in both parquet files when registering the parquet files with datafusion-cli.
However, I am somehow stuck with expressing this sort order as the with order (columnname ordering) clause seems to be unusable together with parquet files.

To Reproduce

The points table has 1 numeric column and the intervals table has two columns. With duckdb I generated parquet files for both:

copy (select generate_series as pt from (select * from generate_series(1, 100000000)) t) TO 'pts.parquet' (FORMAT PARQUET);
copy (select generate_series as first, first +2 as last from (select * from generate_series (2, 100, 5))) TO 'intervals.parquet' (FORMAT PARQUET);

(which I both ran through parquet-rewrite to add page level statistics).

parquet-rewrite --input pts.parquet --output pts_stats.parquet --writer-version 2.0 --statistics-enabled page
parquet-rewrite --input intervals.parquet --output intervals_stats.parquet --writer-version 2.0 --statistics-enabled page

I would expect that the sort order present in both files could help very much to query efficiently (I would pick a sort-merge-join, but without the sort of the files being necessary, as the data is already sorted, so just the merge part with scanning for the overlapping parts of the table).

Hence, I tried to register the parquet files with datafusion-cli and express that the files are already sorted.

This works:

CREATE EXTERNAL TABLE points
STORED AS PARQUET
LOCATION 'pts_stats.parquet';

But this does not:

CREATE EXTERNAL TABLE points
STORED AS PARQUET
LOCATION 'pts_stats.parquet'
WITH ORDER (pt ASC)
;

It errors with:

Error during planning: Provide a schema before specifying the order while creating a table.

If I try to specify the schema manually:

CREATE EXTERNAL TABLE points
(pt bigint)
STORED AS PARQUET
LOCATION 'pts_stats.parquet'
WITH ORDER (pt ASC)
;

It errors with:

Error during planning: Column definitions can not be specified for PARQUET files.

Which is not surprising.

Expected behavior

I would expect that the sort order can be expressed and that the query planner subsequently can execute a sort-merge-join, where the sort step would be unnecessary (as the data of both sides of the join is already sorted).

Additional context

No response

The text was updated successfully, but these errors were encountered:

ozankabak · 2023-07-20T11:49:55Z

This is currently supported for CSV files but not Parquet. I think this would be good first issue for new contributors.

akevinge · 2023-07-22T13:45:51Z

@ozankabak I'm new to datafusion - would love to work on this.

ozankabak · 2023-07-22T19:29:54Z

Great! Happy to help with reviewing 🚀

akevinge · 2023-07-23T00:16:00Z

This is currently supported for CSV files but not Parquet. I think this would be good first issue for new contributors.

It is supported for CSV, but there's a bit of loopy logic. The issue is that manually specifying the schema for a Parquet file will error and so does not specifying the scheme when there's an ordering - checkmate. The reason it works for CSV's is because we allow the specifying of schemas for them.

Is there a reason behind disallowing schemas for Parquet files? In the docs it says "It is not necessary to provide schema information for Parquet files," but that makes it sound optional when it is disallowed by implementation.

@ozankabak

ozankabak · 2023-07-23T17:32:30Z

Interesting. @metesynnada can you take a quick look and offer a suggestion?

metesynnada · 2023-07-24T11:03:21Z

I'm uncertain why providing a schema when creating a table is not possible. I believe @tustvold could shed some light on this matter.

tustvold · 2023-07-24T11:36:19Z

I don't see a reason why this isn't possible, being able to provide the schema for ListingTable is perfectly reasonable

metesynnada · 2023-07-24T12:31:55Z

For some reason, it is restricted and there is no .slt test with a schema provided. I will open an issue for this.

bmmeijers · 2023-07-24T14:34:03Z

I am not sure about how this works internally, but to me (as end user) it makes more sense if it would be possible to only specify the sort order, without having to specify the already known schema.

Having to specify the schema again as end user can only lead to inconsistencies (e.g. I specify the wrong type for a column that is present), but I am not sure if this would be allowed by how things are structured at the moment.

Would it not be sufficient to check whether the column(s) relevant for sorting is/are present in the file?

akevinge · 2023-07-25T05:03:48Z

@metesynnada

Currently schemas are always resolved for Parquet files, so there shouldn't be a need to specify the schema. I'm inclined to agree with @bmmeijers here, but you might have some context that says otherwise.

During parsing, can we just ignore the fact that the schema is inferred later? I'm not familiar with the implementation of WITH ORDER. Still looking at it.

metesynnada · 2023-07-25T06:53:21Z

There are numerous scenarios where providing a schema becomes essential, such as specifying an integer type or a primary key. Given its extensive use, it's surprising that we don't currently offer a schema provision option. A significant number of end-users will find value in this feature, even if it's not universally needed. I believe it's crucial to support this highly utilized SQL pattern.

However, providing a schema might not be a mandatory requirement for using the WITH ORDER clause. Given that the user is aware of the columns, we can validate the order information after the schema has been automatically determined.

alamb · 2023-07-30T11:37:46Z

Given its extensive use, it's surprising that we don't currently offer a schema provision option. A significant number of end-users will find value in this feature, even if it's not universally needed. I believe it's crucial to support this highly utilized SQL pattern.

I agree -- I don't think there is any reason not to support defining a schema for parquet files. I think it was historically less useful than for CSF files, as the column type information is encoded directly in parquet files so for basic uses, it is more convenient if the user doesn't have to explicitly define the columns in the SQL

I agree with @bmmeijers on the ideal end user behavior is #7036 (comment)

However, in my opinion, if the initial implementation of WITH ORDER for parquet required the schema to be explicitly defined, that would be better than the current state where it is simply not possible. We can always improve the behavior in a follow on

alamb · 2023-08-13T11:16:28Z

FWIW I think after the work from @devinjdangelo in #7244 this feature should now be a matter of hooking up the code (or maybe even removing an error) and writing a test. So marking it as a good first issue.

Here is a reproducer:

cpu.zip

❯ create external table cpu stored as parquet location '/tmp/foo/cpu.parquet';
0 rows in set. Query took 0.002 seconds.

❯ select * from cpu limit 10;
+------+-----------------------------------+---------------------+
| cpu  | host1                             | time                |
+------+-----------------------------------+---------------------+
| cpu2 | MacBook-Pro-8.hsd1.ma.comcast.net | 2022-09-30T12:55:00 |
+------+-----------------------------------+---------------------+
1 row in set. Query took 0.007 seconds.

The goal is to make this work:

❯ create external table cpu stored as parquet location '/tmp/foo/cpu.parquet' with order (time);
Error during planning: Provide a schema before specifying the order while creating a table.

Possibly:

❯ create external table cpu(cpu varchar, host varchar, time timestamp) stored as parquet location '/tmp/foo/cpu.parquet' with order (time);
Error during planning: Column definitions can not be specified for PARQUET files.

edmondop · 2023-08-15T15:52:34Z

FWIW I think after the work from @devinjdangelo in #7244 this feature should now be a matter of hooking up the code (or maybe even removing an error) and writing a test. So marking it as a good first issue.

Here is a reproducer:

cpu.zip

❯ create external table cpu stored as parquet location '/tmp/foo/cpu.parquet';
0 rows in set. Query took 0.002 seconds.

❯ select * from cpu limit 10;
+------+-----------------------------------+---------------------+
| cpu  | host1                             | time                |
+------+-----------------------------------+---------------------+
| cpu2 | MacBook-Pro-8.hsd1.ma.comcast.net | 2022-09-30T12:55:00 |
+------+-----------------------------------+---------------------+
1 row in set. Query took 0.007 seconds.

The goal is to make this work:

❯ create external table cpu stored as parquet location '/tmp/foo/cpu.parquet' with order (time);
Error during planning: Provide a schema before specifying the order while creating a table.

Possibly:

❯ create external table cpu(cpu varchar, host varchar, time timestamp) stored as parquet location '/tmp/foo/cpu.parquet' with order (time);
Error during planning: Column definitions can not be specified for PARQUET files.

@alamb should we use this issue to track the fix or do we want to open a subissue ? I can start looking into this soon

alamb · 2023-08-17T12:15:09Z

I double checked the current behavior on main. It is now possible to specify the sort order for the parquet file (and you can see the output _order is correctly reflected)

❯ create external table cpu(time timestamp) stored as parquet location 'cpu.parquet' with order (time desc);
0 rows in set. Query took 0.001 seconds.

❯ select * from cpu;
+---------------------+
| time                |
+---------------------+
| 2022-09-30T12:55:00 |
+---------------------+
1 row in set. Query took 0.003 seconds.
❯ explain select * from cpu order by time desc;
+---------------+-----------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                        |
+---------------+-----------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Sort: cpu.time DESC NULLS FIRST                                                                                             |
|               |   TableScan: cpu projection=[time]                                                                                          |
| physical_plan | ParquetExec: file_groups={1 group: [[Users/alamb/Downloads/cpu.parquet]]}, projection=[time], output_ordering=[time@0 DESC] |
|               |                                                                                                                             |
+---------------+-----------------------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.001 seconds.

However, it is not possible to specify just the order without the schema:

❯ create external table cpu stored as parquet location 'cpu.parquet' with order (time desc);
Error during planning: Provide a schema before specifying the order while creating a table.

Per @edmondop 's suggestion I think the clearest thing is to close this ticket as complete (the sort order can be specified) and I will open a new ticket to allow specifying order without setting the schema

alamb · 2023-08-17T12:25:32Z

Filed #7317 to track

bmmeijers added the bug Something isn't working label Jul 20, 2023

ozankabak added the good first issue Good for newcomers label Jul 20, 2023

metesynnada mentioned this issue Jul 24, 2023

Allow providing file schema for parquet files #7067

Open

alamb mentioned this issue Aug 9, 2023

Extend insert into to support Parquet backed tables #7244

Merged

devinjdangelo mentioned this issue Aug 13, 2023

Enable creating and inserting to empty external tables via SQL #7276

Merged

alamb mentioned this issue Aug 14, 2023

[EPIC] Streaming partitioned writes #6569

Open

38 tasks

alamb closed this as completed Aug 17, 2023

alamb mentioned this issue Aug 17, 2023

Allowing setting sort order of parquet files without specifying the schema #7317

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allowing to expose sort order of parquet file to datafusion does not work #7036

Allowing to expose sort order of parquet file to datafusion does not work #7036

bmmeijers commented Jul 20, 2023 •

edited

Loading

ozankabak commented Jul 20, 2023

akevinge commented Jul 22, 2023

ozankabak commented Jul 22, 2023

akevinge commented Jul 23, 2023

ozankabak commented Jul 23, 2023

metesynnada commented Jul 24, 2023

tustvold commented Jul 24, 2023

metesynnada commented Jul 24, 2023

bmmeijers commented Jul 24, 2023 •

edited

Loading

akevinge commented Jul 25, 2023

metesynnada commented Jul 25, 2023

alamb commented Jul 30, 2023

alamb commented Aug 13, 2023

edmondop commented Aug 15, 2023

alamb commented Aug 17, 2023

alamb commented Aug 17, 2023

Allowing to expose sort order of parquet file to datafusion does not work #7036

Allowing to expose sort order of parquet file to datafusion does not work #7036

Comments

bmmeijers commented Jul 20, 2023 • edited Loading

Describe the bug

To Reproduce

Expected behavior

Additional context

ozankabak commented Jul 20, 2023

akevinge commented Jul 22, 2023

ozankabak commented Jul 22, 2023

akevinge commented Jul 23, 2023

ozankabak commented Jul 23, 2023

metesynnada commented Jul 24, 2023

tustvold commented Jul 24, 2023

metesynnada commented Jul 24, 2023

bmmeijers commented Jul 24, 2023 • edited Loading

akevinge commented Jul 25, 2023

metesynnada commented Jul 25, 2023

alamb commented Jul 30, 2023

alamb commented Aug 13, 2023

edmondop commented Aug 15, 2023

alamb commented Aug 17, 2023

alamb commented Aug 17, 2023

bmmeijers commented Jul 20, 2023 •

edited

Loading

bmmeijers commented Jul 24, 2023 •

edited

Loading