-
Notifications
You must be signed in to change notification settings - Fork 14.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KAFKA-6234: Increased timeout value for lowWatermark response to avoid test failing occasionally #4238
Conversation
SUCCESS |
FAILURE |
SUCCESS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR, left one comment.
@@ -759,7 +759,7 @@ class AdminClientIntegrationTest extends IntegrationTestHarness with Logging { | |||
|
|||
val future = result.lowWatermarks().get(topicPartition) | |||
try { | |||
lowWatermark = future.get(1000L, TimeUnit.MILLISECONDS).lowWatermark() | |||
lowWatermark = future.get(5000L, TimeUnit.MILLISECONDS).lowWatermark() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use the get
method without a timeout. waitUntilTrue
has its own timeout, so we don't need to have another one here.
06eb877
to
a218fe9
Compare
Thanks for the comment Ismael, that does indeed make much more sense. Have updated the PR. |
FAILURE |
SUCCESS |
1 similar comment
SUCCESS |
Not sure how that test failure came to pass, the exception should have been caught and retried. Apparently the test has transient failures for more than one scenario. Is this potentially a mixup of the scala and java versions of LeaderNotAvailableExceptions and thus the catch is ignored? |
The exception type is |
@@ -759,7 +759,7 @@ class AdminClientIntegrationTest extends IntegrationTestHarness with Logging { | |||
|
|||
val future = result.lowWatermarks().get(topicPartition) | |||
try { | |||
lowWatermark = future.get(1000L, TimeUnit.MILLISECONDS).lowWatermark() | |||
lowWatermark = future.get().lowWatermark() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this @soenkeliebau . I agree with your reasoning on the JIRA. But I think as for the fix it is better to still set the timeout which is equal to the default value of waitUntilTrue
(it is 15secs), because otherwise we may be blocked more than what is specified as the limit in waitUntilTrue
, and then still catch the TimeoutException
internally and return false.
cc @cmccabe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do we catch the TimeoutException
internally? Using a timeout in get
is a bit of an anti-pattern, set the request timeout in AdminClient
appropriately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Regarding set the timeout: that's right, as long as we set the timeout config in
adminClient
that should be sufficient, thatfuture.get()
without parameter would still throw after configured timeout. - We should still catch
TimeoutException
infuture.get()
and return false;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added some code to unnest the LeaderNotAvailableException and rethrow if anything else is nested. Also catch the TimeoutException and return false.
As far as I can tell, timeouts are set to 20 in AdminClient and defaults to 15 for waitUntilTrue, so the Timeout should never occur and we are fine - if it does, we handle it.
I'm not sure if in this case it wouldn't be better to explicitly state the timeouts though, to make the purpose of the test easier to understand without digging through default timeout values. Then again, I could just add a comment to that effect :) I'll wait for your input before I push more commits..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want to catch the TimeoutException
since it's longer than waitUntilTrue
and it's a different failure mode than if waitUntilTrue
times out because LeaderNotAvailableException
happened for too long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I've removed the check for TimeoutException, since Guozhang hasn't weighed in again and he was advocationg for catching it.
@@ -759,7 +759,7 @@ class AdminClientIntegrationTest extends IntegrationTestHarness with Logging { | |||
|
|||
val future = result.lowWatermarks().get(topicPartition) | |||
try { | |||
lowWatermark = future.get(1000L, TimeUnit.MILLISECONDS).lowWatermark() | |||
lowWatermark = future.get().lowWatermark() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want to catch the TimeoutException
since it's longer than waitUntilTrue
and it's a different failure mode than if waitUntilTrue
times out because LeaderNotAvailableException
happened for too long.
lowWatermark == 5L | ||
} catch { | ||
case e: LeaderNotAvailableException => false | ||
case e: TimeoutException => false | ||
case e: ExecutionException => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can do this as:
case e: ExecutionException if e.getCause == LeaderNotAvailableException => false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've refactored as per your suggestion, however the == didn't work as comparison for me since "LeaderNotAvailableException is not a value", so I changed it to isInstanceOf[].
retest this please |
1 similar comment
retest this please |
I had high hopes this time :) |
retest this please |
case e: LeaderNotAvailableException => false | ||
} | ||
|
||
case e: ExecutionException if e.getCause.isInstanceOf[LeaderNotAvailableException] => false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test is failing when the cause is NotLeaderForPartitionException
. Is that the exception that you meant to use? Or do we need to catch both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've also seen it fail with LeaderNotAvailableException in the very beginning, so we probably need to catch both exceptions here. To be honest I never really questioned the original code that caught only LeaderNotAvailableException and simply fixed unnesting the exception.
Perhaps it would make sense to check for RetriableException instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed the code to test for a retriable exception and as far as I can tell it does what it is supposed to do. This should also cover all other relevant cases that allow us to resubmit the delete request I think. For anything else it is ok to fail the test I think.
I've also rebased and squashed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should stick to checking the two leader related exceptions since those are the only ones we expect to be thrown in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to change of course. I do wonder whether we might make the test too narrow though. If we encounter a retriable exception, retry and get the correct result, should we really fail the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if we are throwing a retriable exception that is unexpected, the test should fail. We can then check if we made a mistake in the test or the code.
fd88669
to
54e6640
Compare
Not sure what happened with the jdk7 test failure, it was this test case failing, but didn't look like an exception issue when I glanced at it, currently jenkins won't speak to me, so can't look in detail, will check back later. |
The latest error was because the low watermark was not 5. We should include the actual watermark in the error message. I'm going to do a separate PR to disable this test while we try to figure it out. |
It's failing often and it seems like there are multiple reasons. PR apache#4238 will re-enable it.
Looking through the log I was unsure whether we actually got a different low watermark or just ran into the timeout for waitUntilTrue because Exceptions kept us from ever getting to the actual check. I'll include the returned value in the error message - we probably need to reset it before the second run, as it is still set to 5 from the first test - so if we run into the timeout due to exceptions the message would read "expected 5 but got 5" which might not help much :) |
Take a look at |
54e6640
to
b5b53ee
Compare
I've added some detail to the error message and reverted the catch back to just the two leader related exceptions. |
It's failing often and it seems like there are multiple reasons. PR #4238 will re-enable it. Author: Ismael Juma <[email protected]> Reviewers: Rajini Sivaram <[email protected]> Closes #4262 from ijuma/temporarily-disable-test-log-start-offset-checkpoint
…l occasionally, this will instead fall back to the wrapping waitUntilTrue timeout. Also added unnesting of exceptions from ExecutionException that was originally missing and put the retrieved value for lowWatermark in the fail message for better readability in case of test failure.
b5b53ee
to
5bd2cdc
Compare
retest this please |
1 similar comment
retest this please |
Hmm..it worked fine for two consecutive runs now, but a little while ago the JDK7 test consistently failed with the same code. Not sure if something else was changed (couldn't find anything, but only spent five minutes looking, so may very well have missed it) or if this was caused by something else entirely. I could never reproduce the test failure locally but it looked like ssl issues in the test log back when it failed, could this have been caused by load on the build server and some component taking too long to become ready? |
retest this please |
It seems to be back in working order for now. Unless someone has any idea of what might have caused the jdk7 tests to consistently fail back in November. |
retest this please |
3 similar comments
retest this please |
retest this please |
retest this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay on review. Left a couple minor comments. Once addressed, I'm inclined to merge and see if the test case stabilizes.
@@ -759,15 +758,17 @@ class AdminClientIntegrationTest extends IntegrationTestHarness with Logging { | |||
// Need to retry if leader is not available for the partition | |||
result = client.deleteRecords(Map(topicPartition -> RecordsToDelete.beforeOffset(0L)).asJava) | |||
|
|||
lowWatermark = Long.MinValue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think it would be cleaner to use an Option
val future = result.lowWatermarks().get(topicPartition) | ||
try { | ||
lowWatermark = future.get(1000L, TimeUnit.MILLISECONDS).lowWatermark() | ||
lowWatermark = future.get().lowWatermark() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it's still useful to pass a timeout to get()
to avoid blocking longer than the timeout passed to waitUntil
. We could use JTestUtils.DEFAULT_MAX_WAIT_MS
or manually set a specific timeout.
nit: no need for parenthesis after lowWatermark
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that requestTimeout
is 20 seconds, so I think get
is OK.
case e: ExecutionException if e.getCause.isInstanceOf[LeaderNotAvailableException] || | ||
e.getCause.isInstanceOf[NotLeaderForPartitionException] => false | ||
} | ||
}, "Expected low watermark of the partition to be 5 but got ".concat( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: a bit more idiomatic to use string interpolation. If we make lowWatermark
an option, then we can simplify this to just
s"Expected low watermark of the partition to be 5 but got $lowWatermark"
…removed unnecessary () from function calls.
Thanks for your comments @hachikuji |
var lowWatermark = result.lowWatermarks().get(topicPartition).get().lowWatermark() | ||
assertEquals(5L, lowWatermark) | ||
var lowWatermark = Option(result.lowWatermarks.get(topicPartition).get.lowWatermark) | ||
assertTrue(lowWatermark.contains(5L)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit better to use assertEquald(Some(5), lowWatermark)
so that you get a good error message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll change that once Jason has chimed in so that I can address his comments as well if he has any.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, LGTM. Just had a minor nitpick.
@@ -743,8 +742,8 @@ class AdminClientIntegrationTest extends IntegrationTestHarness with Logging { | |||
|
|||
sendRecords(producers.head, 10, topicPartition) | |||
var result = client.deleteRecords(Map(topicPartition -> RecordsToDelete.beforeOffset(5L)).asJava) | |||
var lowWatermark = result.lowWatermarks().get(topicPartition).get().lowWatermark() | |||
assertEquals(5L, lowWatermark) | |||
var lowWatermark = Option(result.lowWatermarks.get(topicPartition).get.lowWatermark) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Could we use Some
? Same below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially used Some(..) but subsequently learned that Some(null) is not None, however Option(null) is None and should be preferred so thought that it is better to use Option here (not that I'd expect lowWatermark to return null, but you never really know).
But I really just read about this today, so this might be misguided :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that makes sense, but in this case lowWatermark
is a long
which cannot be null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point, I'll push a commit with this fixed and @ijuma 's concern addressed.
…me(..) in Option Assignments.
Thanks for the updates. I'll merge once the builds complete (assuming no problems). |
…transient failures (apache#4238) Removed timeout from get call that caused the test to fail occasionally, this will instead fall back to the wrapping waitUntilTrue timeout. Also added unnesting of exceptions from ExecutionException that was originally missing and put the retrieved value for lowWatermark in the fail message for better readability in case of test failure. Reviewers: Ismael Juma <[email protected]>, Jason Gustafson <[email protected]>
…transient failures (apache#4238) Removed timeout from get call that caused the test to fail occasionally, this will instead fall back to the wrapping waitUntilTrue timeout. Also added unnesting of exceptions from ExecutionException that was originally missing and put the retrieved value for lowWatermark in the fail message for better readability in case of test failure. Reviewers: Ismael Juma <[email protected]>, Jason Gustafson <[email protected]>
…transient failures (apache#4238) Removed timeout from get call that caused the test to fail occasionally, this will instead fall back to the wrapping waitUntilTrue timeout. Also added unnesting of exceptions from ExecutionException that was originally missing and put the retrieved value for lowWatermark in the fail message for better readability in case of test failure. Reviewers: Ismael Juma <[email protected]>, Jason Gustafson <[email protected]>
Increase timeout to fix flaky integration test testLogStartOffsetCheckpoint.