-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add WindowStart and WindowEnd UDFs #1993
Add WindowStart and WindowEnd UDFs #1993
Conversation
I think the approach works. You can only get at the window information after the fact, so it makes sense to select it from the output. The window is still derived from the input, so i don't see any issues. |
I am ok with the outlined approach. AFAICT there's no real prior art from the world of traditional SQL because the concept of windowing doesn't exist there. FWIW, Calcite (another streaming SQL tool) follows a similar approach, cf. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for this (and +100 for rejecting the SMT approach 🙃 )
Also just to be clear:
This change broke our own clickstream demo and possibly other customer use-cases.
It broke things for many other people, all following the same pattern as our clickstream demo (and a very common pattern to take) - KSQL to aggregate data, and use the window start time as the basis for analytics on that aggregated data.
I don't think that |
The name for us will be |
Overall this approach makes sense to me, as do the names WindowStart and WindowEnd. |
@@ -0,0 +1,159 @@ | |||
{ | |||
"comments": [ | |||
"Test cases covering WindowStartTime UDAF" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add a case to validate that the result can be used in expressions (e.g. something like SELECT ... WindowStart() / 1000 AS START_SECONDS ...
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add a test case.
@big-andy-coates I like the idea of using |
I discussed this this @hjafarpour because I was initially confused why we need a aggregation UDF bu not an regular UDF. We discussed the overall design and it seems this approach won't work atm. It might be better to implement a regular UDF that is applied to Thoughts? |
I think the problem there is needing 2 queries. If you want to export the results of a windowed aggregation to an external sink, you'd need a whole other query to do so - which isn't ideal. Using a UDAF basically gives you a marker in the intermediate schema of the aggregate node that you can use to swap the bogus result of the UDAF with the window start (or end). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should work fine. The code will replace the value of the UDAF with the value extracted from the key and as long as the type return type for the UDAF is the same as the type for window start time schemas will be correct.
Left a minor comment.
public WindowSelectMapper( | ||
final Map<Integer, KsqlAggregateFunction> aggFunctionsByIndex) { | ||
this.windowSelects = aggFunctionsByIndex.entrySet().stream() | ||
.filter(e -> e.getValue().getFunctionName().equals(WindowStartTimeKudaf.getFunctionName())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be case insensitive comparison.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
Thanks to clarifying @rodesai and @hjafarpour. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
} | ||
|
||
@Test | ||
public void shouldNoUseSelectMapperForNonWindowed() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: shouldNotUse
CONFLICT (content): Merge conflict in ksql-engine/src/test/java/io/confluent/ksql/structured/SchemaKStreamTest.java CONFLICT (content): Merge conflict in ksql-engine/src/test/java/io/confluent/ksql/planner/plan/ProjectNodeTest.java CONFLICT (content): Merge conflict in ksql-engine/src/test/java/io/confluent/ksql/planner/plan/AggregateNodeTest.java CONFLICT (content): Merge conflict in ksql-engine/src/main/java/io/confluent/ksql/structured/SchemaKTable.java CONFLICT (content): Merge conflict in ksql-engine/src/main/java/io/confluent/ksql/structured/SchemaKStream.java CONFLICT (content): Merge conflict in ksql-engine/src/main/java/io/confluent/ksql/structured/SchemaKGroupedStream.java CONFLICT (content): Merge conflict in ksql-engine/src/main/java/io/confluent/ksql/structured/QueuedSchemaKStream.java CONFLICT (content): Merge conflict in ksql-engine/src/main/java/io/confluent/ksql/planner/plan/AggregateNode.java
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@big-andy-coates We need to get this in 5.1. This currently has mater as target. Can you also change the target to 5.1.x branch before merging. |
* add WindowStartTime UDAF (cherry picked from commit 82b603d)
🎉 awesome, thanks for doing this @big-andy-coates ! |
Description
Fixes #1674
Requires #2024 to be merged before it can be completed. Still to do (These will be done in a follow up PR):
This PR adds two new UDAFs to allow access to the bounds of a windowed key:
WindowStart()
andWindowEnd()
. These can be used to access the window bounds of the current window e.g.Background
Before v5.0 the message coming out from
WINDOW
ed queries would have a timestamp equal to the start of the window. This was a bug in streams, which has been fixed, meaning that for v.50 the messages from the same queries came out with a timestamp equal to the timestamp of the message that caused the record to be output, i.e. not equal to the start of the window. This change broke our own clickstream demo and possibly other customer use-cases.Our own clickstream demo was broken because we use a Connect SMT to extract the records timestamp and make it available, (as the window start time), in ElasticSearch.
Possible fixes
It was proposed that we could fix the regression by:
ROWWINDOW
implicit column to windowed queries, e.g.SELECT id, ROWWINDOW->start FROM test WINDOW TUMBLING (SIZE 5 SECONDS) GROUP BY id;
WindowStart
WindowEnd
methods, e.g.SELECT id, WindowStart() FROM test WINDOW TUMBLING (SIZE 5 SECONDS) GROUP BY id;
Of these, I've looked to implement the
WindowStart
approach. Why? I'll tell you:ROWWINDOW
doesn't make conceptual sense as it is not a feature of the source data. The other implicit columns related to the source stream. For example, consider a query likeSELECT id, ROWKEY, ROWWINDOW->start FROM test WINDOW ...
. In this caseROWKEY
is the key of row from thetest
source. ButROWWINDOW
would not be fromtest
, it would be the window details of the output, not the intput. So I think this would makeROWWINDOW
very confusing / counterintuitive.To put this another way, selecting
ROWTIME
andROWKEY
would be selecting fields on the input schema. Where asROWWINDOW
would be on the output schema. You can have sources that might or might not already be windowed. So if you have a source that is already windowed, a user might understandably expectROWWINDOW
to be the window information of the input, not the output.Adding a SMT is tricky. We can't easily add a KSQL specific SMT to Apache Connect. We can have one implemented in the KSQL repo that's packaged to a jar that would need to be dropped into Connect's class path to work - but this sucks in terms of UX.
That leaves
WindowStart()
... now this probably also falls into the same category asROWWINDOW
in that normally theselect
statement is selecting data from the source, where as this is selecting from the output. However, for some reason it seems to fit better - and its also easier to implement. Though I'm still open to other suggestions!Solution
I've implemented
WindowStart
andWindowEnd
UDAFs. However, because this is selecting from the output, it can't actually be implemented using just UDAF. So the method itself has no implementation. It's just a placeholder so thatWindowSelectMapper
can detect where in the output row the window bounds should be put. This is necessary because theWindowed
key is only available post theaggregate
step, i.e. in the secondSelectMapper
.Testing done
Manual + Json based test.
Reviewer checklist