KAFKA-3902 #1556

phderome · 2016-06-26T12:33:09Z

The contribution is my original work and that I license the work to the project under the project's open source license.

Contributors: Guozhang Wang, Phil Derome
@guozhangwang

Added checkEmpty to validate processor does nothing and added a inhibit check for filter to fix issue.

…ing transitive build failures

Added checkEmpty to validate processor does nothing and added a inhibit check for filter to fix issue.

Fixed some kstream internals unit tests.

phderome · 2016-06-26T20:26:40Z

Added in modifications to integration/internals unit tests

…te appears to be too small for Jenkins)

…E-3902

phderome · 2016-06-27T21:07:11Z

Merged in apache:kafka trunk to my changes into PR so that it's a proper merge with trunk.

guozhangwang · 2016-06-27T21:49:55Z

streams/src/main/java/org/apache/kafka/streams/kstream/internals/KTableFilter.java

@@ -28,7 +28,7 @@
    private final Predicate<K, V> predicate;
    private final boolean filterNot;

-    private boolean sendOldValues = false;
+    private boolean sendOldValues = true;


My concern is that, the overhead of requesting the source KTable to be materialized (i.e. creating a state store, and sending the {old -> new} pair instead of the new value only) may be over-whelming compared with its potential benefits of reducing the downstream traffic.

I am looking for your guidance here. On the one hand, I think your point has merit and have difficulty assessing the tradeoff given limited knowledge of app usage and app internals; on the other hand, I'd be very curious to hear what comparable projects do in similar situation (I read Samza's demonstration of materialized tables and user clicks example, so perhaps you have familiarity with Samza's or others' design decisions if in sync with K-Streams, you might have worked on that project as well). Specifically, I wonder if the rationale you had presented for case 2 in the JIRA statement (sendOldValues=false) really requires sending keys with nulls on false evaluation of filter (is that how Samza would do it too?). If case 2's rationale can be challenged, then this ticket can have a meaningful outcome, otherwise it seems that this ticket can not lead to a behaviour change.

I'd be happy to reset this to false if you find that more advisable. But then, I don't see any hook at higher level of abstraction to enable sending old values so that for instance RegionView example from Confluent would work without a final filter to exclude nulls on Long deserializer.

Samza does not have a high-level DSL interface, so it dos not have this issue.

This issue is related to the more general question that when should we materialize a KTable object with a state store, and currently the rule is that stateless operators like filters should not necessarily materialize it. That being said, if a further downstream operator is stateful, it will cause backward propagation of setting enableSendOldValues to true up till the source KTable, as I mentioned in the email thread.

… add extra filter condition.

…ssed. We could/should add unit test cases when aggregation is used and see that nulls are suppressed when filtering.

phderome · 2016-06-28T14:17:43Z

If you like it as is so far, we could also consider more unit testing for aggregation triggering the suppression of nulls and also possibly bringing in a non-Lambda of UserRegion example from Confluent to apache (here) assuming it would work as expected.

…avoiding unpleasant possible side effects due to state when running multiple tests together.

phderome · 2016-06-28T22:39:15Z

I don't understand why the last commit causes a regression. On my local host the same test PlaintextProducerSendTest passes fine. Also a "git pull https://github.com/apache/kafka trunk" informs me that I am up to date.

guozhangwang · 2016-06-28T23:14:36Z

Yeah this will be a good idea to add more test cases in KTableFilterTest for this scenario.

If you think it is a transient failure, you can close / re-open this PR to trigger another Jenkins build and see if it passes.

…ingful materialization filter test to avoid setting enableSendingOldValues, which is not available to DSL API user.

phderome · 2016-06-29T10:51:04Z

I am keeping your second fix of if (change.oldValue == null && change.newValue == null) as it suppresses unnecessary nulls in my view when there is no materialization happening. I kept a comment saying "Guozhang's second fix" for those specifically to help your review but would remove comments once you have seen them. I also added an aggregation materialization unit case.

phderome · 2016-06-29T19:59:48Z

New unit test as of this morning.

Thanks about transient failure tip.

phil
On 28 Jun 2016 7:14 p.m., "Guozhang Wang" [email protected] wrote:

Yeah this will be a good idea to add more test cases in KTableFilterTest
for this scenario.

If you think it is a transient failure, you can close / re-open this PR to
trigger another Jenkins build and see if it passes.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#1556 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AKhhmKj1xxj0XV8vYNff80N3lzZyOJShks5qQarugaJpZM4I-i7H
.

guozhangwang · 2016-06-29T22:38:48Z

streams/src/main/java/org/apache/kafka/streams/kstream/internals/KTableFilter.java

@@ -77,7 +77,9 @@ public void process(K key, Change<V> change) {
            V newValue = computeValue(key, change.newValue);
            V oldValue = sendOldValues ? computeValue(key, change.oldValue) : null;

-            if (sendOldValues && oldValue == null && newValue == null) return; // unnecessary to forward here.
+            if ((sendOldValues && oldValue == null && newValue == null) ||
+                (change.oldValue == null && change.newValue == null)) return; // unnecessary to forward here.


Actually I think we do not need the second condition here, since a better fix would be in the mapValues function, do not forward if both are null, instead of checking them here.

Also minor comment: use a new line for "return" for better debugging, and move the comment on top of line 80 as "if both the new and old values are null after the filtering, do not need to forward downstream anymore".

Seems the minor comment is not addressed in your latest PR.

I confirm that you are absolutely right about your last mail about trying to remove sendOldValues from test is an error and I had misunderstood the implication. I also appreciate the debug-friendly syntax of return on next line, I'll adjust that. I'll also remove the change test as per your comment above that spells out you want to handle that in mapValues function. Hopefully, things will be sorted out within an hour or so.

…e intuitive and much simpler

phderome · 2016-06-30T15:04:39Z

I am keeping your second fix of if (change.oldValue == null && change.newValue == null) as it suppresses unnecessary nulls in my view when there is no materialization happening. I kept a comment saying "Guozhang's second fix" for those specifically to help your review but would remove comments once you have seen them. I also added an aggregation materialization unit case.

phderome · 2016-06-30T15:06:51Z

There is no longer a test on change.oldValue/newValue being null and no longer a test on sendOldValues to filter out a null (so same logic for cases #2 and #3 instead of just #3); accordingly unit test cases now filter out nulls on table that use filter more aggressively.

guozhangwang · 2016-06-30T18:54:19Z

streams/src/main/java/org/apache/kafka/streams/kstream/internals/KTableFilter.java

@@ -77,6 +77,7 @@ public void process(K key, Change<V> change) {
            V newValue = computeValue(key, change.newValue);
            V oldValue = sendOldValues ? computeValue(key, change.oldValue) : null;

+            if (oldValue == null && newValue == null) return; // unnecessary to forward here.


I'm not sure if we can optimize the case when sendOldValues is false as well. Following your example:

KTable T1 = builder.table("source-topic");
KTable T2 = table.filter(value > 2);
T2.to("sink-topic");

And suppose the "source-topic" is piping the messages to T1 as: {a: 3}, {b: 5}, {a: 1}...

When {a: 3} is passed from T1 to T2, the filter will pass and hence it is forwarded to downstream operators already; so now when later {a: 1} is passed from T1 to T2, meaning "modifying the value with key {a} from 3 to 1", the filter will not pass any more, and hence in this case we need to forward a {a: null} record downstreams in order to indicate the previously forwarded {a: 3} has now been deleted in T2, right? Otherwise the sink topic will have the following messages:

{a: 3}, {b: 5}, ...

whereas the right sequence should be

{a: 3}, {b: 5}, {a: null}, ...

Clear explanation which I validated with a temporary test that proved indeed it was a wrong idea. The deletion under not copying old value failed to occur with my earlier commit.

So, it's back to original solution with return on 2nd line and a unit test to demonstrate usage of sendOldValues indirectly via materialization caused by aggregation.

phderome · 2016-07-01T16:58:45Z

@guozhangwang your suggestions have been integrated as is now as far as I can judge. Most recent build is to bring my branch DEROME-3902 with recent trunk commits.

guozhangwang · 2016-07-01T20:06:52Z

streams/src/test/java/org/apache/kafka/streams/kstream/internals/KTableFilterTest.java

+            new Predicate<String, String>() {
+                @Override
+                public boolean test(String key, String value) {
+                    return value.compareToIgnoreCase("accept") == 0;


Could use equalsIgnoreCase directly.

guozhangwang · 2016-07-01T20:12:53Z

LGTM overall except some comments on the unit tests.

guozhangwang · 2016-07-01T20:53:55Z

@phderome I may be misleading in my previous comment regarding the mock, that we can actually use org.apache.kafka.test.NoOpKeyValueMapper here to reduce duplicated code here, does that sound good to you?

phderome · 2016-07-01T21:06:34Z

@guozhangwang about NoOpKeyValueMapper for groupBy I am favourable as it leads to simpler and more concise example. Not sure if the reduce can be simplified further. See for instance:
KTableImpl<String, String, String> table1 =
(KTableImpl<String, String, String>) builder.table(stringSerde, stringSerde, topic1);
KTableImpl<String, String, String> table2 = (KTableImpl<String, String, String>) table1.filter(
new Predicate<String, String>() {
@OverRide
public boolean test(String key, String value) {
return value.equalsIgnoreCase("accept");
}
}).groupBy(MockKeyValueMapper.<String, String>NoOpKeyValueMapper())
.reduce(MockReducer.STRING_ADDER, MockReducer.STRING_REMOVER, "mock-result");

… if both old and new values are null The contribution is my original work and that I license the work to the project under the project's open source license. Contributors: Guozhang Wang, Phil Derome guozhangwang Added checkEmpty to validate processor does nothing and added a inhibit check for filter to fix issue. Author: Philippe Derome <[email protected]> Author: Phil Derome <[email protected]> Author: Damian Guy <[email protected]> Reviewers: Guozhang Wang <[email protected]> Closes #1556 from phderome/DEROME-3902 (cherry picked from commit 2098529) Signed-off-by: Guozhang Wang <[email protected]>

guozhangwang · 2016-07-01T23:47:56Z

The latest patch LGTM, merged to both trunk and 0.10.0. Thanks @phderome !

phderome · 2016-07-02T00:25:35Z

Thanks @guozhangwang for all the patient help. This is my first open source commit!

miguno · 2016-07-04T10:18:18Z

streams/src/main/java/org/apache/kafka/streams/kstream/internals/KTableFilter.java

@@ -77,6 +77,9 @@ public void process(K key, Change<V> change) {
            V newValue = computeValue(key, change.newValue);
            V oldValue = sendOldValues ? computeValue(key, change.oldValue) : null;

+            if (sendOldValues && oldValue == null && newValue == null)


Just to double-check because the JIRA ticket is a bit difficult to follow: Is this single condition sufficient to cover the following discussion/flow in the JIRA ticket?

If "send old value" is enabled, then there are a couple of cases we can consider:

a. If old value is <key: null> and new value is <key: not-null>, and the filter predicate return false for the new value, then in this case it is safe to optimize and not returning anything to the downstream operator, since in this case we know there is no value for the key previously anyways; otherwise we send the original pair.

b. If old value is <key: not-null> and new value is <key: null>, indicating to delete this key, and the filter predicate return false for the old value, then in this case it is safe to optimize and not returning anything to the downstream operator, since we know that the old value has already been filtered in a previous message; otherwise we send the original pair.

c. If both old and new values are not null, and:

predicate return true on both, send the original pair;

predicate return false on both, we can optimize and do not send anything;

predicate return true on old and false on new, send the key: {old -> null};

predicate return false on old and true on new, send the key: {null -> new};

With /cc @guozhangwang

Yes it covers all cases above, which can be summarized as if both new and old values are null.

dguy and others added 4 commits June 24, 2016 08:09

remove verify left join from KStreamRepartitionJoinTest as it is caus…

3827d91

…ing transitive build failures

KAFKA-3902

e07388b

Added checkEmpty to validate processor does nothing and added a inhibit check for filter to fix issue.

KAFKA-3902

3563044

Added checkEmpty to validate processor does nothing and added a inhibit check for filter to fix issue.

KAFKA-3902

721e00d

Fixed some kstream internals unit tests.

dguy and others added 3 commits June 27, 2016 14:29

Make the JoinWindow in KStreamRepartitionJoinTest much larger (1 minu…

0a93d16

…te appears to be too small for Jenkins)

Merge branch 'kafka-3896' of https://github.com/dguy/kafka into DEROM…

1919fa5

…E-3902

Merge branch 'trunk' of https://github.com/apache/kafka into DEROME-3902

aa036cb

guozhangwang reviewed Jun 27, 2016
View reviewed changes

phderome added 2 commits June 27, 2016 21:09

avoid unconditional materialization as per Guozhang's explanation and…

a607e24

… add extra filter condition.

Fix test cases accordingly so only the oldValue null, null are suppre…

e6beae8

…ssed. We could/should add unit test cases when aggregation is used and see that nulls are suppressed when filtering.

clears state after a check as it used to be in the first place, thus …

a8f9ef7

…avoiding unpleasant possible side effects due to state when running multiple tests together.

phderome added 3 commits June 28, 2016 22:56

revert the double null filter addition in KTableFilter and add a mean…

7c277cf

…ingful materialization filter test to avoid setting enableSendingOldValues, which is not available to DSL API user.

revert the double null filter addition in KTableFilter and add a mean…

73479d3

…ingful materialization filter test to avoid setting enableSendingOldValues, which is not available to DSL API user.

keep Guozhang's 2nd fix

b039e4e

guozhangwang reviewed Jun 29, 2016
View reviewed changes

phderome added 3 commits June 30, 2016 08:35

suppress nulls more aggressively than originally discussed, seems mor…

356b0a6

…e intuitive and much simpler

Merge branch 'trunk' of https://github.com/apache/kafka into DEROME-3902

66aada0

suppress nulls more aggressively than originally discussed, seems mor…

b2f5c06

…e intuitive and much simpler

guozhangwang reviewed Jun 30, 2016
View reviewed changes

revert back to Guozhang's suggestions.

0fe41c8

phderome added 2 commits June 30, 2016 20:16

revert back to Guozhang's suggestions.

5b19a71

Merge branch 'trunk' of https://github.com/apache/kafka into DEROME-3902

c1e4ddf

guozhangwang reviewed Jul 1, 2016
View reviewed changes

phderome added 2 commits July 1, 2016 16:22

removed unnecessary/incorrect comments.

117c661

simplified unit test testSkipNullOnMaterialization as per feedback.

525d1b4

simplified unit test testSkipNullOnMaterialization as per feedback.

1dccdba

asfgit closed this in 2098529 Jul 1, 2016

miguno reviewed Jul 4, 2016
View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-3902 #1556

KAFKA-3902 #1556

phderome commented Jun 26, 2016

phderome commented Jun 26, 2016

phderome commented Jun 27, 2016

guozhangwang Jun 27, 2016

phderome Jun 28, 2016

guozhangwang Jun 28, 2016

phderome commented Jun 28, 2016

phderome commented Jun 28, 2016

guozhangwang commented Jun 28, 2016

phderome commented Jun 29, 2016

phderome commented Jun 29, 2016

guozhangwang Jun 29, 2016

guozhangwang Jun 30, 2016

phderome Jun 30, 2016

phderome commented Jun 30, 2016

phderome commented Jun 30, 2016

guozhangwang Jun 30, 2016

phderome Jul 1, 2016

phderome commented Jul 1, 2016

guozhangwang Jul 1, 2016

phderome Jul 1, 2016

guozhangwang commented Jul 1, 2016

guozhangwang commented Jul 1, 2016

phderome commented Jul 1, 2016 •

edited

Loading

guozhangwang commented Jul 1, 2016

phderome commented Jul 2, 2016

miguno Jul 4, 2016 •

edited

Loading

miguno Jul 4, 2016

guozhangwang Jul 5, 2016

KAFKA-3902 #1556

KAFKA-3902 #1556

Conversation

phderome commented Jun 26, 2016

phderome commented Jun 26, 2016

phderome commented Jun 27, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phderome commented Jun 28, 2016

phderome commented Jun 28, 2016

guozhangwang commented Jun 28, 2016

phderome commented Jun 29, 2016

phderome commented Jun 29, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phderome commented Jun 30, 2016

phderome commented Jun 30, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phderome commented Jul 1, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guozhangwang commented Jul 1, 2016

guozhangwang commented Jul 1, 2016

phderome commented Jul 1, 2016 • edited Loading

guozhangwang commented Jul 1, 2016

phderome commented Jul 2, 2016

miguno Jul 4, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phderome commented Jul 1, 2016 •

edited

Loading

miguno Jul 4, 2016 •

edited

Loading