KAFKA-6234: Increased timeout value for lowWatermark response to avoid test failing occasionally #4238

soenkeliebau · 2017-11-20T06:30:05Z

Increase timeout to fix flaky integration test testLogStartOffsetCheckpoint.

asfgit · 2017-11-20T07:41:07Z

SUCCESS
8083 tests run, 5 skipped, 0 failed.
--none--

asfgit · 2017-11-20T07:57:20Z

FAILURE
7975 tests run, 5 skipped, 1 failed.
--none--

asfgit · 2017-11-20T08:03:15Z

SUCCESS
8083 tests run, 5 skipped, 0 failed.
--none--

ijuma

Thanks for the PR, left one comment.

ijuma · 2017-11-20T11:30:53Z

core/src/test/scala/integration/kafka/api/AdminClientIntegrationTest.scala

@@ -759,7 +759,7 @@ class AdminClientIntegrationTest extends IntegrationTestHarness with Logging {

      val future = result.lowWatermarks().get(topicPartition)
      try {
-        lowWatermark = future.get(1000L, TimeUnit.MILLISECONDS).lowWatermark()
+        lowWatermark = future.get(5000L, TimeUnit.MILLISECONDS).lowWatermark()


We should use the get method without a timeout. waitUntilTrue has its own timeout, so we don't need to have another one here.

soenkeliebau · 2017-11-20T11:50:08Z

Thanks for the comment Ismael, that does indeed make much more sense. Have updated the PR.

asfgit · 2017-11-20T13:34:38Z

FAILURE
7975 tests run, 5 skipped, 2 failed.
--none--

asfgit · 2017-11-20T13:37:37Z

SUCCESS
8083 tests run, 5 skipped, 0 failed.
--none--

asfgit · 2017-11-20T13:37:39Z

SUCCESS
8083 tests run, 5 skipped, 0 failed.
--none--

soenkeliebau · 2017-11-20T13:53:16Z

Not sure how that test failure came to pass, the exception should have been caught and retried. Apparently the test has transient failures for more than one scenario. Is this potentially a mixup of the scala and java versions of LeaderNotAvailableExceptions and thus the catch is ignored?

ijuma · 2017-11-20T13:56:31Z

The exception type is ExecutionException and LeaderNotAvailableException is the cause. The test needs to be fixed.

guozhangwang · 2017-11-20T18:11:38Z

core/src/test/scala/integration/kafka/api/AdminClientIntegrationTest.scala

@@ -759,7 +759,7 @@ class AdminClientIntegrationTest extends IntegrationTestHarness with Logging {

      val future = result.lowWatermarks().get(topicPartition)
      try {
-        lowWatermark = future.get(1000L, TimeUnit.MILLISECONDS).lowWatermark()
+        lowWatermark = future.get().lowWatermark()


Thanks for looking into this @soenkeliebau . I agree with your reasoning on the JIRA. But I think as for the fix it is better to still set the timeout which is equal to the default value of waitUntilTrue (it is 15secs), because otherwise we may be blocked more than what is specified as the limit in waitUntilTrue, and then still catch the TimeoutException internally and return false.

cc @cmccabe

Where do we catch the TimeoutException internally? Using a timeout in get is a bit of an anti-pattern, set the request timeout in AdminClient appropriately.

Regarding set the timeout: that's right, as long as we set the timeout config in adminClient that should be sufficient, that future.get() without parameter would still throw after configured timeout.

We should still catch TimeoutException in future.get() and return false;

I've added some code to unnest the LeaderNotAvailableException and rethrow if anything else is nested. Also catch the TimeoutException and return false.
As far as I can tell, timeouts are set to 20 in AdminClient and defaults to 15 for waitUntilTrue, so the Timeout should never occur and we are fine - if it does, we handle it.
I'm not sure if in this case it wouldn't be better to explicitly state the timeouts though, to make the purpose of the test easier to understand without digging through default timeout values. Then again, I could just add a comment to that effect :) I'll wait for your input before I push more commits..

I don't think we want to catch the TimeoutException since it's longer than waitUntilTrue and it's a different failure mode than if waitUntilTrue times out because LeaderNotAvailableException happened for too long.

Ok, I've removed the check for TimeoutException, since Guozhang hasn't weighed in again and he was advocationg for catching it.

ijuma · 2017-11-21T14:48:35Z

core/src/test/scala/integration/kafka/api/AdminClientIntegrationTest.scala

@@ -759,7 +759,7 @@ class AdminClientIntegrationTest extends IntegrationTestHarness with Logging {

      val future = result.lowWatermarks().get(topicPartition)
      try {
-        lowWatermark = future.get(1000L, TimeUnit.MILLISECONDS).lowWatermark()
+        lowWatermark = future.get().lowWatermark()


I don't think we want to catch the TimeoutException since it's longer than waitUntilTrue and it's a different failure mode than if waitUntilTrue times out because LeaderNotAvailableException happened for too long.

ijuma · 2017-11-21T14:49:34Z

core/src/test/scala/integration/kafka/api/AdminClientIntegrationTest.scala

        lowWatermark == 5L
      } catch {
-        case e: LeaderNotAvailableException => false
+        case e: TimeoutException => false
+        case e: ExecutionException => {


You can do this as:

case e: ExecutionException if e.getCause == LeaderNotAvailableException => false

I've refactored as per your suggestion, however the == didn't work as comparison for me since "LeaderNotAvailableException is not a value", so I changed it to isInstanceOf[].

ijuma · 2017-11-23T14:24:24Z

retest this please

ijuma · 2017-11-23T19:31:10Z

retest this please

soenkeliebau · 2017-11-23T20:25:19Z

I had high hopes this time :)

ijuma · 2017-11-23T23:14:07Z

retest this please

ijuma · 2017-11-24T00:59:21Z

core/src/test/scala/integration/kafka/api/AdminClientIntegrationTest.scala

-        case e: LeaderNotAvailableException => false
-      }
-
+        case e: ExecutionException if e.getCause.isInstanceOf[LeaderNotAvailableException] => false


The test is failing when the cause is NotLeaderForPartitionException. Is that the exception that you meant to use? Or do we need to catch both?

We've also seen it fail with LeaderNotAvailableException in the very beginning, so we probably need to catch both exceptions here. To be honest I never really questioned the original code that caught only LeaderNotAvailableException and simply fixed unnesting the exception.

Perhaps it would make sense to check for RetriableException instead?

I've changed the code to test for a retriable exception and as far as I can tell it does what it is supposed to do. This should also cover all other relevant cases that allow us to resubmit the delete request I think. For anything else it is ok to fail the test I think.

I've also rebased and squashed.

I think we should stick to checking the two leader related exceptions since those are the only ones we expect to be thrown in this case.

I'm happy to change of course. I do wonder whether we might make the test too narrow though. If we encounter a retriable exception, retry and get the correct result, should we really fail the test?

Yes, if we are throwing a retriable exception that is unexpected, the test should fail. We can then check if we made a mistake in the test or the code.

soenkeliebau · 2017-11-24T12:38:03Z

Not sure what happened with the jdk7 test failure, it was this test case failing, but didn't look like an exception issue when I glanced at it, currently jenkins won't speak to me, so can't look in detail, will check back later.

ijuma · 2017-11-24T14:27:22Z

The latest error was because the low watermark was not 5. We should include the actual watermark in the error message. I'm going to do a separate PR to disable this test while we try to figure it out.

It's failing often and it seems like there are multiple reasons. PR apache#4238 will re-enable it.

soenkeliebau · 2017-11-24T14:45:39Z

Looking through the log I was unsure whether we actually got a different low watermark or just ran into the timeout for waitUntilTrue because Exceptions kept us from ever getting to the actual check.

I'll include the returned value in the error message - we probably need to reset it before the second run, as it is still set to 5 from the first test - so if we run into the timeout due to exceptions the message would read "expected 5 but got 5" which might not help much :)
I'll look into setting an unitialized value so we can test for that and print a specific error message that allows to differentiate between "got a wrong value" and "got no value at all".

ijuma · 2017-11-24T14:53:34Z

Take a look at TestUtils.computeUntilTrue if waitUntilTrue makes it difficult to include the value in the error.

soenkeliebau · 2017-11-24T16:04:57Z

I've added some detail to the error message and reverted the catch back to just the two leader related exceptions.
It's not what I'd call beautiful code, there is probably some nice scala-y way of doing the check for two exceptions, but I couldn't come up with something nicer after googling a while tbh.
Let me know what you think of the extended error message.

It's failing often and it seems like there are multiple reasons. PR #4238 will re-enable it. Author: Ismael Juma <[email protected]> Reviewers: Rajini Sivaram <[email protected]> Closes #4262 from ijuma/temporarily-disable-test-log-start-offset-checkpoint

…l occasionally, this will instead fall back to the wrapping waitUntilTrue timeout. Also added unnesting of exceptions from ExecutionException that was originally missing and put the retrieved value for lowWatermark in the fail message for better readability in case of test failure.

ijuma · 2017-12-20T13:32:35Z

retest this please

soenkeliebau · 2017-12-21T10:25:59Z

retest this please

soenkeliebau · 2017-12-21T13:06:25Z

Hmm..it worked fine for two consecutive runs now, but a little while ago the JDK7 test consistently failed with the same code. Not sure if something else was changed (couldn't find anything, but only spent five minutes looking, so may very well have missed it) or if this was caused by something else entirely. I could never reproduce the test failure locally but it looked like ssl issues in the test log back when it failed, could this have been caused by load on the build server and some component taking too long to become ready?

soenkeliebau · 2018-01-11T11:58:39Z

retest this please

soenkeliebau · 2018-01-12T16:56:50Z

It seems to be back in working order for now. Unless someone has any idea of what might have caused the jdk7 tests to consistently fail back in November.
Are there any other comments on the PR as such? Otherwise I'd suggest merging and keeping an eye on whether the tests starts acting up again.

soenkeliebau · 2018-02-06T12:38:20Z

retest this please

hachikuji · 2018-02-12T18:50:22Z

retest this please

soenkeliebau · 2018-03-13T21:01:00Z

retest this please

soenkeliebau · 2018-04-05T14:52:37Z

retest this please

hachikuji

Sorry for the delay on review. Left a couple minor comments. Once addressed, I'm inclined to merge and see if the test case stabilizes.

hachikuji · 2018-04-11T20:36:52Z

core/src/test/scala/integration/kafka/api/AdminClientIntegrationTest.scala

@@ -759,15 +758,17 @@ class AdminClientIntegrationTest extends IntegrationTestHarness with Logging {
      // Need to retry if leader is not available for the partition
      result = client.deleteRecords(Map(topicPartition -> RecordsToDelete.beforeOffset(0L)).asJava)

+      lowWatermark = Long.MinValue


nit: I think it would be cleaner to use an Option

hachikuji · 2018-04-11T20:37:58Z

core/src/test/scala/integration/kafka/api/AdminClientIntegrationTest.scala

      val future = result.lowWatermarks().get(topicPartition)
      try {
-        lowWatermark = future.get(1000L, TimeUnit.MILLISECONDS).lowWatermark()
+        lowWatermark = future.get().lowWatermark()


I wonder if it's still useful to pass a timeout to get() to avoid blocking longer than the timeout passed to waitUntil. We could use JTestUtils.DEFAULT_MAX_WAIT_MS or manually set a specific timeout.

nit: no need for parenthesis after lowWatermark.

Note that requestTimeout is 20 seconds, so I think get is OK.

hachikuji · 2018-04-11T20:41:44Z

core/src/test/scala/integration/kafka/api/AdminClientIntegrationTest.scala

+        case e: ExecutionException if e.getCause.isInstanceOf[LeaderNotAvailableException] ||
+          e.getCause.isInstanceOf[NotLeaderForPartitionException] => false
+        }
+    }, "Expected low watermark of the partition to be 5 but got ".concat(


nit: a bit more idiomatic to use string interpolation. If we make lowWatermark an option, then we can simplify this to just

s"Expected low watermark of the partition to be 5 but got $lowWatermark"

…removed unnecessary () from function calls.

soenkeliebau · 2018-04-12T13:33:33Z

Thanks for your comments @hachikuji
I have hopefully addressed them in the last commit. I've changed your proposed log message a little, but I think it still looks cleaner than before.

ijuma · 2018-04-12T15:13:27Z

core/src/test/scala/integration/kafka/api/AdminClientIntegrationTest.scala

-    var lowWatermark = result.lowWatermarks().get(topicPartition).get().lowWatermark()
-    assertEquals(5L, lowWatermark)
+    var lowWatermark = Option(result.lowWatermarks.get(topicPartition).get.lowWatermark)
+    assertTrue(lowWatermark.contains(5L))


It's a bit better to use assertEquald(Some(5), lowWatermark) so that you get a good error message.

Ok, I'll change that once Jason has chimed in so that I can address his comments as well if he has any.

hachikuji

Thanks, LGTM. Just had a minor nitpick.

hachikuji · 2018-04-12T15:30:41Z

core/src/test/scala/integration/kafka/api/AdminClientIntegrationTest.scala

@@ -743,8 +742,8 @@ class AdminClientIntegrationTest extends IntegrationTestHarness with Logging {

    sendRecords(producers.head, 10, topicPartition)
    var result = client.deleteRecords(Map(topicPartition -> RecordsToDelete.beforeOffset(5L)).asJava)
-    var lowWatermark = result.lowWatermarks().get(topicPartition).get().lowWatermark()
-    assertEquals(5L, lowWatermark)
+    var lowWatermark = Option(result.lowWatermarks.get(topicPartition).get.lowWatermark)


nit: Could we use Some? Same below.

I initially used Some(..) but subsequently learned that Some(null) is not None, however Option(null) is None and should be preferred so thought that it is better to use Option here (not that I'd expect lowWatermark to return null, but you never really know).

But I really just read about this today, so this might be misguided :)

Yeah, that makes sense, but in this case lowWatermark is a long which cannot be null.

Fair point, I'll push a commit with this fixed and @ijuma 's concern addressed.

…me(..) in Option Assignments.

hachikuji · 2018-04-12T20:25:34Z

Thanks for the updates. I'll merge once the builds complete (assuming no problems).

…transient failures (apache#4238) Removed timeout from get call that caused the test to fail occasionally, this will instead fall back to the wrapping waitUntilTrue timeout. Also added unnesting of exceptions from ExecutionException that was originally missing and put the retrieved value for lowWatermark in the fail message for better readability in case of test failure. Reviewers: Ismael Juma <[email protected]>, Jason Gustafson <[email protected]>

ijuma reviewed Nov 20, 2017

View reviewed changes

soenkeliebau force-pushed the KAFKA-6234 branch from 06eb877 to a218fe9 Compare November 20, 2017 11:49

guozhangwang reviewed Nov 20, 2017

View reviewed changes

ijuma reviewed Nov 21, 2017

View reviewed changes

ijuma reviewed Nov 24, 2017

View reviewed changes

soenkeliebau force-pushed the KAFKA-6234 branch from fd88669 to 54e6640 Compare November 24, 2017 10:17

ijuma added a commit to ijuma/kafka that referenced this pull request Nov 24, 2017

MINOR: Temporarily disable testLogStartOffsetCheckpoint

a24d170

It's failing often and it seems like there are multiple reasons. PR apache#4238 will re-enable it.

ijuma mentioned this pull request Nov 24, 2017

MINOR: Temporarily disable testLogStartOffsetCheckpoint #4262

Closed

3 tasks

soenkeliebau force-pushed the KAFKA-6234 branch from 54e6640 to b5b53ee Compare November 24, 2017 16:03

soenkeliebau force-pushed the KAFKA-6234 branch from b5b53ee to 5bd2cdc Compare November 28, 2017 13:07

hachikuji reviewed Apr 11, 2018

View reviewed changes

Addressed Jason's comments: Used Option instead of Long.MinValue and …

8437bfd

…removed unnecessary () from function calls.

ijuma reviewed Apr 12, 2018

View reviewed changes

hachikuji approved these changes Apr 12, 2018

View reviewed changes

Refactored assert Statements to give better error message and used So…

1c5acf3

…me(..) in Option Assignments.

hachikuji force-pushed the KAFKA-6234 branch from 86c9c11 to df5e0fa Compare April 12, 2018 22:43

Add space before type declaration

7f93009

hachikuji force-pushed the KAFKA-6234 branch from df5e0fa to 7f93009 Compare April 12, 2018 22:43

hachikuji merged commit 886daf5 into apache:trunk Apr 12, 2018

KAFKA-6234: Increased timeout value for lowWatermark response to avoid test failing occasionally #4238

KAFKA-6234: Increased timeout value for lowWatermark response to avoid test failing occasionally #4238

Conversation

soenkeliebau commented Nov 20, 2017

asfgit commented Nov 20, 2017

asfgit commented Nov 20, 2017

asfgit commented Nov 20, 2017

ijuma left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

soenkeliebau commented Nov 20, 2017

asfgit commented Nov 20, 2017

asfgit commented Nov 20, 2017

asfgit commented Nov 20, 2017

soenkeliebau commented Nov 20, 2017

ijuma commented Nov 20, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ijuma commented Nov 23, 2017

ijuma commented Nov 23, 2017

soenkeliebau commented Nov 23, 2017

ijuma commented Nov 23, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

soenkeliebau commented Nov 24, 2017

ijuma commented Nov 24, 2017

soenkeliebau commented Nov 24, 2017

ijuma commented Nov 24, 2017

soenkeliebau commented Nov 24, 2017

ijuma commented Dec 20, 2017

soenkeliebau commented Dec 21, 2017

soenkeliebau commented Dec 21, 2017

soenkeliebau commented Jan 11, 2018

soenkeliebau commented Jan 12, 2018

soenkeliebau commented Feb 6, 2018

hachikuji commented Feb 12, 2018

soenkeliebau commented Mar 13, 2018

soenkeliebau commented Apr 5, 2018

hachikuji left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

soenkeliebau commented Apr 12, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hachikuji left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

soenkeliebau Apr 12, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hachikuji commented Apr 12, 2018

soenkeliebau Apr 12, 2018 •

edited

Loading