-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: unify behavior for PARTITION BY and GROUP BY #3982
Conversation
@agavra it seems that the reason for this change is more because of implementation details than user experience! It's more intuitive to partition by one of the columns in the current schema rather than a field that may not exist in the result schema! Consider the following example: CREATE STREAM foo AS SELECT col1*col2 AS new_col1, col3*100 as new_col2
FROM bar
PARTITION BY new_col1; We know that the result schema is CREATE STREAM foo AS SELECT col1*col2 AS new_col1, col3*100 as new_col2
FROM bar
PARTITION BY col1; and the result would be partitioned based on a column that no longer exist! |
@hjafarpour - originally, I thought the same thing, but this isn't motivated by implementation. Your argument would stand for Also, @big-andy-coates convinced me: |
@agavra I'm still not convinced with making CREATE STREAM foo AS SELECT col1*col2 AS new_col1, col3*100 as new_col2
FROM bar
PARTITION BY new_col1; and it should be written as the following: CREATE STREAM foo AS SELECT col1*col2 AS new_col1, col3*100 as new_col2
FROM bar
PARTITION BY col1*col2; which we don't support it yet! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@agavra work calling ^^^ out in the breaking change description? |
A
@hjafarpour, I think you're still thinking in terms of only the value columns being part of the schema. We've moving away from this with the work on primitive keys, (almost there) and structured keys. With this work done the key columns are as much a part of the schema as the value columns. If you take into account that the key columns are in a schema, then the example starts to look more like: CREATE STREAM foo AS SELECT col1*col2 AS new_col1, col3*100 as new_col2
FROM bar
PARTITION BY col1*col2 AS Id; And the resulting schema is something like the following, (make up types) But actually, this is duplicating the key column in the value, which is a waste of space if the downstream system doesn't need it, so your initial query might be better re-written as: CREATE STREAM foo AS SELECT col3*100 as new_col2
FROM bar
PARTITION BY col1*col2 AS new_col1; with a result schema of However, I agree we'll need expression support in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @agavra, this looks great.
One thing to check through ... we should avoid the RepartitionNode
if its PARTITION BY ROWKEY
.
And as discussed, we'd ideally need expression support in PARTITION BY, which has been requested anyway and brings things inline with GROUP BY. (Separate PR -> maybe create a Gibhub issue to track and add to the project)
@@ -188,6 +188,10 @@ protected AstNode visitQuery(final Query node, final C context) { | |||
final Optional<GroupBy> groupBy = node.getGroupBy() | |||
.map(exp -> ((GroupBy) rewriter.apply(exp, context))); | |||
|
|||
// don't rewrite the partitionBy because we expect it to be | |||
// exactly as it was (a single, un-aliased, column reference) | |||
final Optional<Expression> partitionBy = node.getPartitionBy(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, may need to change...
+ "| Logger: InsertQuery_1.S1")); | ||
assertThat(lines[2], equalTo("\t\t\t\t > [ PROJECT ] | Schema: [ROWKEY STRING KEY, COL0 BIGINT, COL1 STRING" | ||
assertThat(lines[2], | ||
containsString("[ REKEY ] | Schema: [TEST1.ROWKEY STRING KEY, TEST1.ROWTIME BIGINT, TEST1.ROWKEY STRING, TEST1.COL0 BIGINT, TEST1.COL1 STRING, TEST1.COL2 DOUBLE] " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the schema now prefixed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It just uses the source schema which is prefixed, for some reason, you can look at the rest of this test class to see this. As long as the sink schema is the same it should be fine I think.
Changing this is out of scope, though we should probably do it.
ksql-functional-tests/src/test/resources/query-validation-tests/key-field.json
Outdated
Show resolved
Hide resolved
@@ -78,6 +77,7 @@ query | |||
(WINDOW windowExpression)? | |||
(WHERE where=booleanExpression)? | |||
(GROUP BY groupBy)? | |||
(PARTITION BY partitionBy=identifier)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
much nicer!
@big-andy-coates yes, you are right in describing my thinking. Key in the current model is more of a meta field than an equivalent part of the schema. I know with the structured key support this will change and this change in |
BREAKING CHANGE: this change makes it so that PARTITION BY statements use the _source_ schema, not the value/projection schema, when selecting the value to partition by. This is consistent with GROUP BY, and standard SQL for GROUP by. Any statement that previously used PARTITION BY may need to be reworked. BREAKING CHANGE: when querying with EMIT CHANGES and PARTITION BY, the PARTITION BY clause should now come before EMIT CHANGES.
Thanks for the review @big-andy-coates!
I don't think we should do that - even with this implementation, we don't repartition because of (your) optimization that's done inside the repartition node ( |
I think the logical plan should not have a repartition step if its If anything, the logic should only exist in the logical plan and the physical plan should throw if it gets through to it still with a repartition. |
Hey @big-andy-coates , I am confused about this:
Why does the fact that the key columns will be part of the schema play a role here and change the example query? Also, I feel that we add more functionality to the If we agree on this, why would the example query @hjafarpour wrote
be written as
There should not be an Finally, the rewriting of the query you suggest for avoiding the duplication of columns should be part of the optimizer. But if we don't allow the |
Thanks for looking into this @vpapavas!
I don't think this is necessarily true; today, the result of the
Today there is an "implicit" |
BREAKING CHANGE: this change makes it so that PARTITION BY statements
use the source schema, not the value/projection schema, when selecting
the value to partition by. This is consistent with GROUP BY, and
standard SQL for GROUP by. Any statement that previously used PARTITION
BY may need to be reworked.
Fixes #2701
Description
This change does two things: (1) it pulls
PARTITION BY
into the Query object so that it mimics theGROUP BY
behavior and makes it clear that it is a function of the source data (not something post-projection) and (2) introduced aRepartitionNode
in the Logical Plan that handles partitions before anything else, instead of using the projection schema.A nice side effect is that the syntax is now
... PARTITION BY FOO EMIT CHANGES
instead of... EMIT CHANGES PARTITION BY FOO
Testing done
Updated existing test coverage to reflect the new behavior.
Reviewer checklist