-
Notifications
You must be signed in to change notification settings - Fork 13.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-34007][k8s] Adds workaround that fixes the deadlock when renewing the leadership lease fails #24132
Conversation
d903e8a
to
bb55379
Compare
ci failure is caused by some infrastructure instability in the compile step of the connect stage:
|
@flinkbot run azure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for creating this PR. It definitely fixed the issue we have already discussed in FLINK-34007. I like the newly added ITCase and just left two minor comments
...est/java/org/apache/flink/kubernetes/kubeclient/resources/KubernetesLeaderElectorITCase.java
Show resolved
Hide resolved
.../src/main/java/org/apache/flink/kubernetes/kubeclient/resources/KubernetesLeaderElector.java
Outdated
Show resolved
Hide resolved
e499f87
to
694f261
Compare
.../src/main/java/org/apache/flink/kubernetes/kubeclient/resources/KubernetesLeaderElector.java
Outdated
Show resolved
Hide resolved
.../src/main/java/org/apache/flink/kubernetes/kubeclient/resources/KubernetesLeaderElector.java
Outdated
Show resolved
Hide resolved
@flinkbot run azure |
hm, the CI timeout should be related to the k8s client version update. I'm not able to reproduce it locally, though (after 3000 repetitions 🤔 ). I will check the release notes tomorrow. |
@flinkbot run azure |
Ok, looks like the timeout was caused by some changes to the MockServer (which was moved into the The mocked GET request contains some parameters. The order of the parameters matters when mocking the requests (...which is annyoying. I didn't find a way to make this order-agnostic). Anyway, updating the GET path makes the test succeed again. For 1.19, I'm gonna go ahead and upgrade Any objections from your side, @gyfora (on the decision to only upgrade to v6.9.2). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good @XComp thanks for taking the time to fix this!
Thanks, @gyfora . I squashed the branch into proper commits. @wangyang0918 do you have anything to add here? |
…@AfterEach methods
…e rely on a BlockingQueue This allows us to wait for tasks to "arrive".
…calls This way, we can use FlinkAssertions#assertThatFuture and use assertion messages instead of comments.
…#run() call v5.12.4 allowed us to reuse the LeaderElector. With v6.6.2 (fabric8io/kubernetes-client#4125) this behavior changed. One LeaderElector can only be used until the leadership is lost. An ITCase is added to cover the scenario where the leadership is lost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating this PR. LGTM.
+1 for merging.
What is the purpose of the change
fabric8io k8s client v6.6.2 was updated from v5.12.4 in Flink 1.18.0 (FLINK-31997). This major version upgrade brought a change of behavior in the
LeaderElector
. The old implementation allowed us to reuse theLeaderElector
instance. The new implementation expects to use one instance per leadership lifecycle (see https://github.com/fabric8io/kubernetes-client/pull/4125/files).See FLINK-34007 for further details.
Brief change log
Verifying this change
The test can be verified locally by running
minikube
locally and setting theITCASE_KUBECONFIG
environment variable to the kube config ofminikube
.Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: noDocumentation