-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only Allow Declaring Partition Columns in PARTITIONED BY
Clause
#9465
Comments
Thanks for the interest in picking this up @Lordworms! Since this would be a breaking change and possibly overlap/conflict with #9369, we should make sure that we have consensus on the plan before getting too deep into it. cc @alamb and @metesynnada if you have any thoughts on this. |
If we can keep both the current behavior as well I think this is a good idea (aka if this is backwards compatible) So specifically, if both the following SQL statements result in the same table schema ( CREATE EXTERNAL TABLE test(partition varchar, trace_id varchar)
STORED AS parquet
PARTITIONED BY (partition) -- no type specified here
LOCATION '/tmp/test/'; CREATE EXTERNAL TABLE test(trace_id varchar)
STORED AS parquet
PARTITIONED BY (partition varchar)
LOCATION '/tmp/test/'; |
If we need to break backwards compatibility, I think it should be discussed more widely |
I think that we either should
I favor the first option and believe it would be much easier to implement. We could perhaps go for a phased approach where we deprecate the existing syntax with a warning of why it is not recommended, though maintaining both syntaxes would be more complex. With all that said if we are against a breaking change, we could simply update documentation to increase visibility into the existing behavior and especially clarify that partition columns must be moved to the end when inserting data. |
This seems reasonable to me Here is what I suggest to get some sort of consensus about this potentially breaking change:
The idea is to get all the feedback into github but try and make sure as many people have a chance to weigh in as possible |
I think this is very much possible with minimal changes to the parser |
@MohamedAbdeen21 if you are interested in working on a PR to implement this, that would be much appreciated 🙏 . I think that would be a great step to take prior to gathering feedback/consensus on making the change breaking. |
I'll try to get a draft PR up before EoD |
Is your feature request related to a problem or challenge?
DataFusion implicity reorders columns in table definitions so that
PARTITION BY
columns are stored at the end of the underlying parquet files. This leads to very confusing behavior when selecting directly out of the parquet file as the parquet schema has a different column order than the order of the columns in theCREATE TABLE
statement.Datafusions SQL dialect declares partitioned tables like this:
Note that the
partition
column is declared with the other columns and again later in thePARTITIONED BY
clause. Internally, Datafusion reorders table schemas so that partition columns come at the end, which is a common convention. This leads to confusing examples like #7892 and the followingSince you declared the order as (partition varchar, trace_id varchar) you would expect this order to be respected when inserting data, but instead it is silently reordered so that the partition column comes at the end.
Describe the solution you'd like
Rework
CREATE EXTERNAL TABLE
syntax to only allow partition by columns to be declared in the partitioned by clause. The above example then becomes:This leaves much less room for confusion about the ordering of the columns when inserting values. This also follows the syntax of HiveQL, see: https://cwiki.apache.org/confluence/display/hive/languagemanual+ddl#LanguageManualDDL-PartitionedTables
Describe alternatives you've considered
We could instead drop the convention of moving the partitioned by columns to the end of the schema and respect the ordering of columns that the user declares.
Additional context
No response
The text was updated successfully, but these errors were encountered: