-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix Flaky-test: BookieZKExpireTest.testBookieServerZKSessionExpireBehaviour #3418
base: master
Are you sure you want to change the base?
fix Flaky-test: BookieZKExpireTest.testBookieServerZKSessionExpireBehaviour #3418
Conversation
…aviour by disable retry session expired
|
||
if (!retryExpired) { | ||
logger.warn("ZooKeeper session expired retries have been disabled"); | ||
return; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recreating new zookeeper instances is prohibited here.
// allow client.waitForConnection() timeout | ||
Thread.sleep(10000); | ||
assertFalse("Bookie should shutdown on losing zk session", server.isBookieRunning()); | ||
assertFalse("Bookie Server should shutdown on losing zk session", server.isRunning()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The increase from 3 seconds to 10 seconds is because creating a bookie temporary node needs to wait for the old zk client to establish a connection with the server. There is a timeout period of conf.getZkTimeout()=6 seconds. Once the timeout period expires, the bookie service will Closed due to session timeout, the extra 4 seconds is to consolidate the stability of the test and give the bookie service a little extra time to perform the shutdown.
ping @dlg99 @hangc0276 @Shoothzj @eolivelli PTAL. |
@StevenLuMT PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a better way to fix it?
@@ -165,7 +165,8 @@ public String getScheme() { | |||
protected void initialize(AbstractConfiguration<?> conf, | |||
StatsLogger statsLogger, | |||
RetryPolicy zkRetryPolicy, | |||
Optional<Object> optionalCtx) throws MetadataException { | |||
Optional<Object> optionalCtx, | |||
boolean zkRetryExpired) throws MetadataException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this check(zkRetryExpired) is difficult to understand, what scene is set to true and what scene is set to false
can you predictably repro the problem? Maybe my fix was not good enough :( but I only could repro it under docker with limited number of CPUs and after the fix it did not fail after running in the loop 150 times. |
fix old workflow,please see #3455 for detail |
When I run the test in my local, it is very flaky |
@poorbarcode There is a race condition describe in the motivation. |
…4144) ### Motivation The `BookieZKExpireTest.testBookieServerZKSessionExpireBehaviour` is a flaky test and the root cause is described in #3418. There is a race condition in the zookeeper reconnection and listener and it is an expected behavior in Bookie. We created one issue to track this #4142. We can skip this flaky test to unblock the pending PRs.
…4144) ### Motivation The `BookieZKExpireTest.testBookieServerZKSessionExpireBehaviour` is a flaky test and the root cause is described in #3418. There is a race condition in the zookeeper reconnection and listener and it is an expected behavior in Bookie. We created one issue to track this #4142. We can skip this flaky test to unblock the pending PRs. (cherry picked from commit 11ccebb)
…pache#4144) ### Motivation The `BookieZKExpireTest.testBookieServerZKSessionExpireBehaviour` is a flaky test and the root cause is described in apache#3418. There is a race condition in the zookeeper reconnection and listener and it is an expected behavior in Bookie. We created one issue to track this apache#4142. We can skip this flaky test to unblock the pending PRs.
…pache#4144) ### Motivation The `BookieZKExpireTest.testBookieServerZKSessionExpireBehaviour` is a flaky test and the root cause is described in apache#3418. There is a race condition in the zookeeper reconnection and listener and it is an expected behavior in Bookie. We created one issue to track this apache#4142. We can skip this flaky test to unblock the pending PRs. (cherry picked from commit 11ccebb)
…pache#4144) ### Motivation The `BookieZKExpireTest.testBookieServerZKSessionExpireBehaviour` is a flaky test and the root cause is described in apache#3418. There is a race condition in the zookeeper reconnection and listener and it is an expected behavior in Bookie. We created one issue to track this apache#4142. We can skip this flaky test to unblock the pending PRs.
Motivation
fix Flaky-test:
BookieZKExpireTest.testBookieServerZKSessionExpireBehaviour
by disable retry session expired.issue #3206 was fixed by PR #3415, but it still introduced another flaky test:
BookieZKExpireTest.testBookieServerZKSessionExpireBehaviour
According to the investigation, the reason is:
ZookeeperClient
still creates a newzookeeper
instance when the old zookeeper client session time out.Due to the asynchronous execution of two threads executing bookie temporary node re-registration and zk instance re-creation, the test program sometimes succeeds and sometimes fails.
When the temporary node re-registration is performed before the zk re-instantiation, the temporary node creation will use the old zk instance, which will cause a session timeout error, the bookie service will be shutdown, and the test will be successful;
When the zk re-instantiation precedes the re-registration of the temporary node, the temporary node creation will use the new re-instantiated zk instance, then the temporary node will be successfully created, the bookie service is running normally, and the test fails.
Changes
Add a
retryExpired
flag to indicate whether to run the zk instance and retry to create a new instance after the session times out.Set this flag to false for
ZKMetadataBookieDriver
;Other ZookeeperClient and normal ZookeeperClient applications will generate the default value true or set to true, which is consistent with the original behavior.
Test the behavior of this PR:
Before this PR:
Executed the test 10 times, all failed.
After this PR:
Executed the test 10 times, all successful.