Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow declaring partition columns in PARTITION BY clause, backwards compatible #9599

Merged
merged 11 commits into from
Mar 30, 2024

Conversation

MohamedAbdeen21
Copy link
Contributor

@MohamedAbdeen21 MohamedAbdeen21 commented Mar 13, 2024

Which issue does this PR close?

Closes #9465.

Rationale for this change

Allow HiveQL syntax when creating external tables with partition columns only defined inside the PARTITION BY clause, while maintaining original syntax (backwards compatible).

What changes are included in this PR?

SQL Parser now parses:

CREATE EXTERNAL TABLE test(trace_id varchar) 
STORED AS parquet
PARTITIONED BY (partition varchar)
LOCATION '/tmp/test/';

as if it was:

CREATE EXTERNAL TABLE test(trace_id varchar, partition varchar) 
STORED AS parquet
PARTITIONED BY (partition)
LOCATION '/tmp/test/';

This means that queries re-defining columns in PARTITIONED BY are rejected with Schema error

Are these changes tested?

Yes

Are there any user-facing changes?

User is now able to use HiveQL syntax when creating partitioned external tables.

@github-actions github-actions bot added the sql SQL Planner label Mar 13, 2024
@@ -679,7 +679,25 @@ impl<'a> DFParser<'a> {
Keyword::PARTITIONED => {
self.parser.expect_keyword(Keyword::BY)?;
ensure_not_set(&builder.table_partition_cols, "PARTITIONED BY")?;
builder.table_partition_cols = Some(self.parse_partitions()?);
let peeked = self.parser.peek_nth_token(2);
if peeked == Token::Comma || peeked == Token::RParen {
Copy link
Contributor Author

@MohamedAbdeen21 MohamedAbdeen21 Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works fine but feels hacky. Considering replacing this if with a more robust function that tries to apply a parsing rule and falls back (undo consumed tokens) when rule fails.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it is not immediately clear why this logic works. I think at a minimum we should add a comment explaining the reasoning of this condition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super familiar with sqlparser crate, but I don't think it allows rewinding tokens, we will have to implement a parsing rule that only uses peeks, which sounds really unnecessary. Will add a comment for now and maybe can find a better way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@MohamedAbdeen21 MohamedAbdeen21 Mar 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks promising. I'll open a follow up PR if it works out. Thanks for pointing it out.

Edit: misread the docs, we need a version of consume_tokens that returns true without actually consuming the tokens, because we need to parse the tokens later.

Also, would be perfect if it can match a pattern to catch mixing syntax in the clause.

@devinjdangelo
Copy link
Contributor

Thank you @MohamedAbdeen21! This looks good. I haven't had a chance to dig into this deeply yet, but I plan to sometime over the next few days when I get some time.

I think one thing we will definitely want prior to merging this is test cases exploring possible edge cases: e.g. validating it isn't possible to mix the two syntaxes in any way to lead to undesired behavior.

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Mar 14, 2024
Copy link
Contributor

@devinjdangelo devinjdangelo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @MohamedAbdeen21 this is looking very close to ready (and certainly ready for wider review/discussion). I left a few comments. Would you mind marking this "ready for review" to increase visibility for other reviewers?

cc @alamb @metesynnada

@@ -679,7 +679,25 @@ impl<'a> DFParser<'a> {
Keyword::PARTITIONED => {
self.parser.expect_keyword(Keyword::BY)?;
ensure_not_set(&builder.table_partition_cols, "PARTITIONED BY")?;
builder.table_partition_cols = Some(self.parse_partitions()?);
let peeked = self.parser.peek_nth_token(2);
if peeked == Token::Comma || peeked == Token::RParen {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it is not immediately clear why this logic works. I think at a minimum we should add a comment explaining the reasoning of this condition.

@@ -175,7 +175,7 @@ pub(crate) type LexOrdering = Vec<OrderByExpr>;
/// [ WITH HEADER ROW ]
/// [ DELIMITER <char> ]
/// [ COMPRESSION TYPE <GZIP | BZIP2 | XZ | ZSTD> ]
/// [ PARTITIONED BY (<column list>) ]
/// [ PARTITIONED BY (<column_definition list> | <column list>) ]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@MohamedAbdeen21 MohamedAbdeen21 marked this pull request as ready for review March 23, 2024 13:30
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @MohamedAbdeen21 -- I am sorry for the delay in reviewing this PR. It looks really nice to me.

Thank you @devinjdangelo for the review.

Is there any chance you can merge up from main so we can make sure there are no conflicts? If not I can handle it too

@alamb alamb merged commit ab88220 into apache:main Mar 30, 2024
23 checks passed
@alamb
Copy link
Contributor

alamb commented Mar 30, 2024

Thanks again @MohamedAbdeen21

Lordworms pushed a commit to Lordworms/arrow-datafusion that referenced this pull request Apr 1, 2024
… compatible (apache#9599)

* Draft allow both syntaxes

* suppress unused code error

* prevent constraints in partition clauses

* fix clippy

* More tests

* comment and prevent constraints on partition columns

* trailing whitespaces

* End-to-End test of new Hive syntax

---------

Co-authored-by: Mohamed Abdeen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sql SQL Planner sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Only Allow Declaring Partition Columns in PARTITIONED BY Clause
3 participants