Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: append additional fields to row values in KsqlMaterialization.java to fix pull query filtering #7336

Merged
merged 11 commits into from
Apr 5, 2021

Conversation

cprasad1
Copy link
Contributor

@cprasad1 cprasad1 commented Apr 1, 2021

Description

Fixes #7312 and other similar situations by appending additional fields to row values in KsqlMaterialization.java

Testing done

  • RQTT
  • Unit Testing

Reviewer checklist

  • Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
  • Ensure relevant issues are linked (description should include text like "Fixes #")

@cprasad1 cprasad1 requested a review from a team as a code owner April 1, 2021 03:38
@ghost
Copy link

ghost commented Apr 1, 2021

@confluentinc It looks like @cprasad1 just signed our Contributor License Agreement. 👍

Always at your service,

clabot

@cprasad1 cprasad1 requested a review from AlanConfluent April 1, 2021 08:37
@@ -174,7 +174,7 @@ public MaterializedWindowedTable windowed() {
final Builder<WindowedRow> builder = ImmutableList.builder();

for (final WindowedRow row : result) {
filterAndTransform(row.windowedKey(), row.value(), row.rowTime())
filterAndTransform(row.windowedKey(), getIntermediateRow(row), row.rowTime())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsure if we even need to append the extra columns for windowed rows. Windowed state stores might need more testing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate a bit more here? Is that because the windowed key may have the needed columns for filtering?

Copy link
Contributor Author

@cprasad1 cprasad1 Apr 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guozhangwang my understanding is that windowed tables always need an aggregation, so they have a state store backing the aggregated table. If that is the case, then we don't need to append this extra metadata as KsqlMaterialization is sophisticated enough to handle those cases (we have test cases for that). I noticed that we follow a similar pattern for windowed rows in ProjectOperator and SelectOperator of generating intermediate rows on which filters and transformations can be applied. I added these fields as a hedge against potential cases that we might miss (obviously it comes at a cost). That being said, I have a couple of questions for you:

  1. Are there any type of Windowed tables that are not queryable today that we want to be able to query?
  2. Can Windowed tables be derived without doing any aggregations? (Specifically, GROUP BY aggregations)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tested it by try a windowed query with group by and having clause. I think that might trigger KsqlMaterialization. In that case, presumably you could say HAVING WINDOWSTART > 20 and therefore need to have those columns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to @AlanConfluent , I think today if you have a WINDOW BY + GROUP BY + HAVING you would have a first windowed table from windowBy+groupBy aggregations, and then a second windowed table from having as a filtering condition. So in that sense not all windowed tables should be generated with aggregations, they can also be generated from table statless operators from other existing windowed tables.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to have this as a RQTT test where the having clause mentions windowstart or windowend

{"row":{"columns":["F"]}}

]}
]
Copy link
Contributor Author

@cprasad1 cprasad1 Apr 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, there are no additional tests for Windowed materialization. Did we intend to make any new new type of Windowed table queryable with all these changes @AlanConfluent ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could always tested a windowed case like

"CREATE TABLE AGGREGATE AS SELECT ID, COUNT(1) AS COUNT FROM INPUT WINDOW TUMBLING(SIZE 1 SECOND) GROUP BY ID HAVING COUNT(1) > 2;",

We don't have a lot of tests with the having clause for pull queries and I think it might trigger the KsqlMaterialization transform logic to apply the filtering. Since they can apply for group bys I assume they work for windowed tables as well. Normal where clauses presumably don't touch KsqlMaterialization when a group by is in place since the materialization happens after the filter has been applied, so that might not need additional testing.

But in general, you're right that a lot of windowed logic has already been tested fairly well in https://github.com/confluentinc/ksql/blob/master/ksqldb-functional-tests/src/test/resources/rest-query-validation-tests/pull-queries-against-materialized-aggregates.json since windows require a group by.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are good cases. It would be good to do a windowed table + group by + having clause mentioning windowstart. That's one last case I don't see.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't allow WINDOWSTART that at the moment, so it's not a testable case. The specific error message is Window bounds column WINDOWSTART can only be used in the SELECT clause of windowed aggregations and can not be passed to aggregate functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related: #4397

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still wondering, if @AlanConfluent 's query is possible:

CREATE TABLE AGGREGATE AS SELECT ID, COUNT(1) AS COUNT FROM INPUT WINDOW TUMBLING(SIZE 1 SECOND) GROUP BY ID HAVING COUNT(1) > 2;

note it does not try to group by window-start/end, while we should still be able to pull query it with conditions on other columns?

Copy link
Contributor

@guozhangwang guozhangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering what's the difference for windowed stores, for my own education. Otherwise the fix lgtm.

@@ -174,7 +174,7 @@ public MaterializedWindowedTable windowed() {
final Builder<WindowedRow> builder = ImmutableList.builder();

for (final WindowedRow row : result) {
filterAndTransform(row.windowedKey(), row.value(), row.rowTime())
filterAndTransform(row.windowedKey(), getIntermediateRow(row), row.rowTime())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate a bit more here? Is that because the windowed key may have the needed columns for filtering?

materialization = new KsqlMaterialization(
inner,
SCHEMA,
ImmutableList.of(project, filter)
ImmutableList.of(filter, project)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you swap these because this is a more realistic ordering?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

{"row":{"columns":["F"]}}

]}
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could always tested a windowed case like

"CREATE TABLE AGGREGATE AS SELECT ID, COUNT(1) AS COUNT FROM INPUT WINDOW TUMBLING(SIZE 1 SECOND) GROUP BY ID HAVING COUNT(1) > 2;",

We don't have a lot of tests with the having clause for pull queries and I think it might trigger the KsqlMaterialization transform logic to apply the filtering. Since they can apply for group bys I assume they work for windowed tables as well. Normal where clauses presumably don't touch KsqlMaterialization when a group by is in place since the materialization happens after the filter has been applied, so that might not need additional testing.

But in general, you're right that a lot of windowed logic has already been tested fairly well in https://github.com/confluentinc/ksql/blob/master/ksqldb-functional-tests/src/test/resources/rest-query-validation-tests/pull-queries-against-materialized-aggregates.json since windows require a group by.

@@ -174,7 +174,7 @@ public MaterializedWindowedTable windowed() {
final Builder<WindowedRow> builder = ImmutableList.builder();

for (final WindowedRow row : result) {
filterAndTransform(row.windowedKey(), row.value(), row.rowTime())
filterAndTransform(row.windowedKey(), getIntermediateRow(row), row.rowTime())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tested it by try a windowed query with group by and having clause. I think that might trigger KsqlMaterialization. In that case, presumably you could say HAVING WINDOWSTART > 20 and therefore need to have those columns.

Copy link
Member

@AlanConfluent AlanConfluent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice PR and good test cases

Copy link
Contributor

@guozhangwang guozhangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the added testing coverage!

@cprasad1 cprasad1 merged commit f8a4609 into confluentinc:master Apr 5, 2021
@cprasad1 cprasad1 deleted the pull_filter_fix branch April 5, 2021 18:40
]
},
{
"name": "persistent query with KEY filter and projection +++ pull query table scan and single key lookup ***FAILURE***",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should we explicitly state the the error root cause here? Since otherwise the name is exactly the same as above except we say it is a ***FAILURE*** case.

{"row":{"columns":["F"]}}

]}
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still wondering, if @AlanConfluent 's query is possible:

CREATE TABLE AGGREGATE AS SELECT ID, COUNT(1) AS COUNT FROM INPUT WINDOW TUMBLING(SIZE 1 SECOND) GROUP BY ID HAVING COUNT(1) > 2;

note it does not try to group by window-start/end, while we should still be able to pull query it with conditions on other columns?

cprasad1 added a commit to cprasad1/ksql that referenced this pull request Apr 5, 2021
…ava` to fix pull query filtering (confluentinc#7336)

* passes unit tests

* semantic

* start adding mores RQTT

* start adding mores RQTT 2

* start adding mores RQTT 3 all greeeeen

* start adding mores RQTT 3 all greeeeen MAX COMPLEX

* FINISH RQTT

* added windowed table tests

* modified tests

* add small comment

* add small comment fix checkstyle

Co-authored-by: Chittaranjan Prasad <>
"name": "windowed - select star with HAVING filter",
"statements": [
"CREATE STREAM INPUT (ID STRING KEY, IGNORED INT) WITH (kafka_topic='test_topic', value_format='JSON');",
"CREATE TABLE AGGREGATE AS SELECT ID, COUNT(1) AS COUNT FROM INPUT WINDOW TUMBLING(SIZE 1 SECOND) GROUP BY ID HAVING COUNT(1) > 1;",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guozhangwang is this the test similar to what you are interested in?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, thanks!

cprasad1 added a commit that referenced this pull request Apr 5, 2021
…ava` to fix pull query filtering (#7336) (#7342)

* passes unit tests

* semantic

* start adding mores RQTT

* start adding mores RQTT 2

* start adding mores RQTT 3 all greeeeen

* start adding mores RQTT 3 all greeeeen MAX COMPLEX

* FINISH RQTT

* added windowed table tests

* modified tests

* add small comment

* add small comment fix checkstyle

Co-authored-by: Chittaranjan Prasad <>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pull queries return 0 rows when there is a key filter node in the persistent query topology
3 participants