Only Allow Declaring Partition Columns in `PARTITIONED BY` Clause #9465

devinjdangelo · 2024-03-05T13:04:22Z

Is your feature request related to a problem or challenge?

DataFusion implicity reorders columns in table definitions so that PARTITION BY columns are stored at the end of the underlying parquet files. This leads to very confusing behavior when selecting directly out of the parquet file as the parquet schema has a different column order than the order of the columns in the CREATE TABLE statement.

Datafusions SQL dialect declares partitioned tables like this:

CREATE EXTERNAL TABLE(partition varchar, trace_id varchar) 
STORED AS parquet
PARTITIONED BY (partition)
LOCATION '/tmp/test/';

Note that the partition column is declared with the other columns and again later in the PARTITIONED BY clause. Internally, Datafusion reorders table schemas so that partition columns come at the end, which is a common convention. This leads to confusing examples like #7892 and the following

DataFusion CLI v36.0.0
❯ create external table test(partition varchar, trace_id varchar) stored as parquet partitioned by (partition) location '/tmp/test/';
0 rows in set. Query took 0.001 seconds.

❯ insert into test values ('a','x'),('b','y'),('c','z');
+-------+
| count |
+-------+
| 3     |
+-------+
1 row in set. Query took 0.016 seconds.

❯ select * from test;
+----------+-----------+
| trace_id | partition |
+----------+-----------+
| a        | x         |
| c        | z         |
| b        | y         |
+----------+-----------+
3 rows in set. Query took 0.002 seconds.

Since you declared the order as (partition varchar, trace_id varchar) you would expect this order to be respected when inserting data, but instead it is silently reordered so that the partition column comes at the end.

Describe the solution you'd like

Rework CREATE EXTERNAL TABLE syntax to only allow partition by columns to be declared in the partitioned by clause. The above example then becomes:

CREATE EXTERNAL TABLE(trace_id varchar) 
STORED AS parquet
PARTITIONED BY (partition varchar)
LOCATION '/tmp/test/';

This leaves much less room for confusion about the ordering of the columns when inserting values. This also follows the syntax of HiveQL, see: https://cwiki.apache.org/confluence/display/hive/languagemanual+ddl#LanguageManualDDL-PartitionedTables

Describe alternatives you've considered

We could instead drop the convention of moving the partitioned by columns to the end of the schema and respect the ordering of columns that the user declares.

Additional context

No response

The text was updated successfully, but these errors were encountered:

devinjdangelo · 2024-03-08T18:21:13Z

Thanks for the interest in picking this up @Lordworms!

Since this would be a breaking change and possibly overlap/conflict with #9369, we should make sure that we have consensus on the plan before getting too deep into it. cc @alamb and @metesynnada if you have any thoughts on this.

alamb · 2024-03-09T09:48:18Z

If we can keep both the current behavior as well I think this is a good idea (aka if this is backwards compatible)

So specifically, if both the following SQL statements result in the same table schema (trace_id varchar, partition varchar)

CREATE EXTERNAL TABLE  test(partition varchar, trace_id varchar) 
STORED AS parquet
PARTITIONED BY (partition) -- no type specified here
LOCATION '/tmp/test/';

CREATE EXTERNAL TABLE test(trace_id varchar) 
STORED AS parquet
PARTITIONED BY (partition varchar)
LOCATION '/tmp/test/';

alamb · 2024-03-09T09:49:19Z

If we need to break backwards compatibility, I think it should be discussed more widely

devinjdangelo · 2024-03-09T13:39:59Z

If we need to break backwards compatibility, I think it should be discussed more widely

I think that we either should

break backwards compatibility and throw an error when a column is declared as both a regular column and a partition column
rework internal logic of partitioned ListingTables so they respect the original order the columns were declared in by the user, even if some are partitioned columns

I favor the first option and believe it would be much easier to implement. We could perhaps go for a phased approach where we deprecate the existing syntax with a warning of why it is not recommended, though maintaining both syntaxes would be more complex.

With all that said if we are against a breaking change, we could simply update documentation to increase visibility into the existing behavior and especially clarify that partition columns must be moved to the end when inserting data.

alamb · 2024-03-11T18:55:17Z

break backwards compatibility and throw an error when a column is declared as both a regular column and a partition column

This seems reasonable to me

Here is what I suggest to get some sort of consensus about this potentially breaking change:

Send a note to the mailing list ("we are considering a breaking change to CREATE EXTERNAL TABLE ... please comment on the ticket if you have an opinion")
Maybe also cross post to discord / slack

The idea is to get all the feedback into github but try and make sure as many people have a chance to weigh in as possible

MohamedAbdeen21 · 2024-03-13T11:48:02Z

If we can keep both the current behavior as well I think this is a good idea (aka if this is backwards compatible)

I think this is very much possible with minimal changes to the parser

devinjdangelo · 2024-03-13T13:10:07Z

I think this is very much possible with minimal changes to the parser

@MohamedAbdeen21 if you are interested in working on a PR to implement this, that would be much appreciated 🙏 . I think that would be a great step to take prior to gathering feedback/consensus on making the change breaking.

MohamedAbdeen21 · 2024-03-13T13:28:21Z

I'll try to get a draft PR up before EoD

devinjdangelo added the enhancement New feature or request label Mar 5, 2024

github-actions bot assigned Lordworms Mar 8, 2024

alamb mentioned this issue Mar 9, 2024

writing to partitioned table uses the wrong column as partition key #7892

Closed

MohamedAbdeen21 mentioned this issue Mar 13, 2024

Allow declaring partition columns in PARTITION BY clause, backwards compatible #9599

Merged

Lordworms removed their assignment Mar 15, 2024

alamb closed this as completed in #9599 Mar 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only Allow Declaring Partition Columns in `PARTITIONED BY` Clause #9465

Only Allow Declaring Partition Columns in `PARTITIONED BY` Clause #9465

devinjdangelo commented Mar 5, 2024 •

edited by alamb

Loading

devinjdangelo commented Mar 8, 2024

alamb commented Mar 9, 2024

alamb commented Mar 9, 2024

devinjdangelo commented Mar 9, 2024

alamb commented Mar 11, 2024

MohamedAbdeen21 commented Mar 13, 2024

devinjdangelo commented Mar 13, 2024

MohamedAbdeen21 commented Mar 13, 2024

Only Allow Declaring Partition Columns in PARTITIONED BY Clause #9465

Only Allow Declaring Partition Columns in PARTITIONED BY Clause #9465

Comments

devinjdangelo commented Mar 5, 2024 • edited by alamb Loading

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

devinjdangelo commented Mar 8, 2024

alamb commented Mar 9, 2024

alamb commented Mar 9, 2024

devinjdangelo commented Mar 9, 2024

alamb commented Mar 11, 2024

MohamedAbdeen21 commented Mar 13, 2024

devinjdangelo commented Mar 13, 2024

MohamedAbdeen21 commented Mar 13, 2024

Only Allow Declaring Partition Columns in `PARTITIONED BY` Clause #9465

Only Allow Declaring Partition Columns in `PARTITIONED BY` Clause #9465

devinjdangelo commented Mar 5, 2024 •

edited by alamb

Loading