Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expression support for PARTITION BY clause #4018

Closed
agavra opened this issue Dec 2, 2019 · 2 comments · Fixed by #4032
Closed

Expression support for PARTITION BY clause #4018

agavra opened this issue Dec 2, 2019 · 2 comments · Fixed by #4032
Assignees

Comments

@agavra
Copy link
Contributor

agavra commented Dec 2, 2019

Originally posted by @big-andy-coates in #3982 (comment)


A GROUP BY is effectively a PARTITION BY, (selectKey in KS parlance?), followed by an aggregation. They two should behave in a consistent manner.

Consider the following example:

CREATE STREAM foo AS SELECT col1*col2 AS new_col1, col3*100 as new_col2 
FROM bar 
PARTITION BY new_col1;

We know that the result schema is new_col1, new_col2 and it is partitioned by the first column, new_col1.

@hjafarpour, I think you're still thinking in terms of only the value columns being part of the schema. We've moving away from this with the work on primitive keys, (almost there) and structured keys. With this work done the key columns are as much a part of the schema as the value columns.

If you take into account that the key columns are in a schema, then the example starts to look more like:

CREATE STREAM foo AS SELECT col1*col2 AS new_col1, col3*100 as new_col2 
FROM bar 
PARTITION BY col1*col2 AS Id;

And the resulting schema is something like the following, (make up types) ID INT KEY, NEW_COL1 INT, NEW_COL2 INT.

But actually, this is duplicating the key column in the value, which is a waste of space if the downstream system doesn't need it, so your initial query might be better re-written as:

CREATE STREAM foo AS SELECT col3*100 as new_col2 
FROM bar 
PARTITION BY col1*col2 AS new_col1;

with a result schema of NEW_COL1 INT KEY, NEW_COL2 INT.

However, I agree we'll need expression support in the PARTITION BY -> @agavra, do we get that for free with your change promoting this to a proper logic node? Or is this something we'll need to add? If it's the former, we should add some QTT tests covering this.

@rayzyar
Copy link

rayzyar commented Feb 26, 2020

@rayzyar
Copy link

rayzyar commented Feb 26, 2020

@agavra Is this fix also available in https://hub.docker.com/r/confluentinc/cp-ksql-server?
My work is based on https://github.com/confluentinc/cp-helm-charts setup

Just found this #1039 (comment)
When would it be released to confluentinc/cp-ksql-server? Any estimation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants