-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: expression support for PARTITION BY #4032
Conversation
9c8bf3c
to
83aaef9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly LGTM. One issue raised inline.
.map(kf -> kf.ref().equals(proposedKey.ref())) | ||
.orElse(false); | ||
|
||
return !namesMatch && !isRowKey(columnRef); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think comparing with the ROWKEY is actually safe, e.g.:
CREATE STREAM LEFT (A STRING, B STRING) WITH (KEY='A');
CREATE STREAM RIGHT (C STRING, D STRING) WITH (KEY='C');
CREATE STREAM JOINED AS SELECT LEFT.*, RIGHT.* FROM LEFT JOIN RIGHT ON L.B=R.D PARTITION BY LEFT.ROWKEY;
In this case, when we hit the repartition, the stream will be partitioned on B/D, so we do want to repartition, even though the column ref is ROWKEY.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @big-andy-coates - I think @rodesai brings up a good point here, though I think it's an existing bug. I think we should fix it in the short-term by re-introducing the "unnecessary" repartition step in the case of isRowKey
but before I do that I wanted to run it past you since you implemented the original optimization (#2735).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest adding this as a QTT test to ensure we're handling this correctly. Do what you need to do to get this functionally working.
Raise a github issue to track the outstanding piece.
I think we can clean this up once we have cleaner code around the handling of schemas and duplicating ROWKEY and ROWTIME into the value schema.
Specifically, once we clean up / remove the use of source-aliases on schema fields and have arbitrary key column names, then I think this will be easy to solve because the ROWKEY
name won't be ambiguous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f8bfefa - I will fix this in a future PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
fixes #4018
Description
This change allows for arbitrary expression support for PARTITION BY expressions:
CodeGenRunner
to evaluate the partition by expression instead of the legacy way which had custom handling for column referencesNote that if a partition by expression is used, the rowkey will not correspond to any column in the value.
Testing done
Reviewer checklist