Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix][test] Fix flaky testCreateTopicWithZombieReplicatorCursor #20037

Merged
merged 1 commit into from
Apr 11, 2023

Conversation

BewareMyPower
Copy link
Contributor

@BewareMyPower BewareMyPower commented Apr 7, 2023

Fixes #20010

Motivation

PersistentTopicTest.testCreateTopicWithZombieReplicatorCursor is flaky because onPoliciesUpdate is asynchronous, while testCreateTopicWithZombieReplicatorCursor updates the namespace policy nearly the same time, so there is a race with the order of updating AbstractTopic#topicPolicies.

Sometimes the policies update might fail because the topic might be deleted in PersistentTopic#checkReplication:

Deleting topic [xxx] because local cluster is not part of global namespace repl list [remote]

Modifications

  • Sleep 100ms between two calls of updating the replication clusters
  • Add the local cluster to the replication cluster list
  • Add the retry logic for initialize

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: BewareMyPower#24

@BewareMyPower
Copy link
Contributor Author

Before this patch:
20230407103236

After this patch:
20230407111940

Example logs of updating replication clusters:

2023-04-07T10:11:27,615 - INFO  - [pulsar-3575-5:BrokerService@2361] - [prop/ns-abc] updating with Policies(auth_policies=AuthPoliciesImpl(..., replication_clusters=[remote]
2023-04-07T10:11:27,619 - INFO  - [pulsar-3575-4:BrokerService@2361] - [prop/ns-abc] updating with Policies(auth_policies=AuthPoliciesImpl(..., replication_clusters=[],

We can see the interval is only 4 milliseconds but they happened in different threads (pulsar-3575-5 and pulsar-3575-4). Sometimes the 2nd update might happen first.

@BewareMyPower BewareMyPower changed the title [fix][broker] Check replication cluster before starting the replicator [fix][broker] Check replication cluster before starting the replicator when policies are updated Apr 7, 2023
@Technoboy- Technoboy- added this to the 3.0.0 milestone Apr 7, 2023
@Technoboy- Technoboy- closed this Apr 7, 2023
@Technoboy- Technoboy- reopened this Apr 7, 2023
@@ -1542,7 +1542,13 @@ public CompletableFuture<Void> checkReplication() {
continue;
}
if (!replicators.containsKey(cluster)) {
futures.add(startReplicator(cluster));
futures.add(checkReplicationCluster(cluster).thenCompose(clusterExists -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method checkReplicationCluster is guarantee that the remote cluster is still in the collection topicPolicies. But the line-1524 and line-1540 also did this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's not related to the flaky test. I've reverted it and changed the PR description. The tests could still pass after that.
image

Fixes apache#20010

### Motivation

`PersistentTopicTest.testCreateTopicWithZombieReplicatorCursor` is flaky
because `onPoliciesUpdate` is asynchronous, while
`testCreateTopicWithZombieReplicatorCursor` updates the namespace policy
nearly the same time, so there is a race with the order of updating
`AbstractTopic#topicPolicies`.

Sometimes the policies update might fail because the topic might be
deleted in `PersistentTopic#checkReplication`:

> Deleting topic [xxx] because local cluster is not part of global namespace repl list [remote]

### Modifications

- Sleep 100ms between two calls of updating the replication clusters
- Add the local cluster to the replication cluster list
- Add the retry logic for `initialize`
@BewareMyPower BewareMyPower changed the title [fix][broker] Check replication cluster before starting the replicator when policies are updated [fix][test] Fix flaky testCreateTopicWithZombieReplicatorCursor Apr 11, 2023
@BewareMyPower BewareMyPower force-pushed the bewaremypower/issue-20010 branch from f8a5806 to bb254f1 Compare April 11, 2023 04:12
@codecov-commenter
Copy link

Codecov Report

Merging #20037 (bb254f1) into master (dd05408) will increase coverage by 38.28%.
The diff coverage is 81.70%.

Impacted file tree graph

@@              Coverage Diff              @@
##             master   #20037       +/-   ##
=============================================
+ Coverage     34.65%   72.94%   +38.28%     
- Complexity    12429    31840    +19411     
=============================================
  Files          1606     1865      +259     
  Lines        125026   138040    +13014     
  Branches      13667    15167     +1500     
=============================================
+ Hits          43332   100693    +57361     
+ Misses        76081    29364    -46717     
- Partials       5613     7983     +2370     
Flag Coverage Δ
inttests 24.19% <0.00%> (-0.20%) ⬇️
systests 24.96% <0.00%> (?)
unittests 72.23% <81.70%> (+39.13%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...uth/KubernetesServiceAccountTokenAuthProvider.java 77.27% <77.27%> (ø)
...ervice/AbstractDispatcherSingleActiveConsumer.java 92.37% <100.00%> (+13.74%) ⬆️
...tent/NonPersistentDispatcherMultipleConsumers.java 71.60% <100.00%> (+32.09%) ⬆️
...ersistentStickyKeyDispatcherMultipleConsumers.java 74.32% <100.00%> (+16.97%) ⬆️
...ersistentStickyKeyDispatcherMultipleConsumers.java 85.44% <100.00%> (+3.67%) ⬆️
...functions/auth/KubernetesFunctionAuthProvider.java 80.00% <100.00%> (+80.00%) ⬆️
...s/runtime/kubernetes/KubernetesRuntimeFactory.java 80.42% <100.00%> (+80.42%) ⬆️
...ime/kubernetes/KubernetesRuntimeFactoryConfig.java 97.36% <100.00%> (+97.36%) ⬆️

... and 1514 files with indirect coverage changes

@cbornet cbornet requested a review from poorbarcode April 11, 2023 13:49
@michaeljmarshall
Copy link
Member

As discussed on the mailing list https://lists.apache.org/thread/w4jzk27qhtosgsz7l9bmhf1t7o9mxjhp, there is no plan to release 2.9.6, so I am going to remove the release/2.9.6 label

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky-test: PersistentTopicTest.testCreateTopicWithZombieReplicatorCursor
7 participants