[FLINK-34007][k8s] Adds workaround that fixes the deadlock when renewing the leadership lease fails #24132

XComp · 2024-01-18T13:42:30Z

What is the purpose of the change

fabric8io k8s client v6.6.2 was updated from v5.12.4 in Flink 1.18.0 (FLINK-31997). This major version upgrade brought a change of behavior in the LeaderElector. The old implementation allowed us to reuse the LeaderElector instance. The new implementation expects to use one instance per leadership lifecycle (see https://github.com/fabric8io/kubernetes-client/pull/4125/files).

See FLINK-34007 for further details.

Brief change log

Switches from reusing the fabric8io LeaderElector to re-instantiating it

Verifying this change

ITCase was added to cover the scenario of FLINK-34007. It fails w/o the fix of this PR but succeeds with the fix being included.

The test can be verified locally by running minikube locally and setting the ITCASE_KUBECONFIG environment variable to the kube config of minikube.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no
The public API, i.e., is any changed class annotated with @Public(Evolving): no
The serializers: no
The runtime per-record code paths (performance sensitive): no
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: yes
The S3 file system connector: no

Documentation

Does this pull request introduce a new feature? no
If yes, how is the feature documented? not applicable

flinkbot · 2024-01-18T13:51:23Z

CI report:

3fd82f5 Azure: SUCCESS

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

XComp · 2024-01-24T07:45:32Z

ci failure is caused by some infrastructure instability in the compile step of the connect stage:

Jan 23 18:59:09 18:59:09.769 [WARNING] locking FileBasedConfig[/home/agent01_azpcontainer/.config/jgit/config] failed after 5 retries

XComp · 2024-01-24T07:45:44Z

@flinkbot run azure

wangyang0918

Thanks for creating this PR. It definitely fixed the issue we have already discussed in FLINK-34007. I like the newly added ITCase and just left two minor comments

...est/java/org/apache/flink/kubernetes/kubeclient/resources/KubernetesLeaderElectorITCase.java

.../src/main/java/org/apache/flink/kubernetes/kubeclient/resources/KubernetesLeaderElector.java

XComp · 2024-01-29T14:00:49Z

@flinkbot run azure

XComp · 2024-01-29T23:05:43Z

hm, the CI timeout should be related to the k8s client version update. I'm not able to reproduce it locally, though (after 3000 repetitions 🤔 ). I will check the release notes tomorrow.

XComp · 2024-01-29T23:05:49Z

@flinkbot run azure

XComp · 2024-01-30T13:26:26Z

Ok, looks like the timeout was caused by some changes to the MockServer (which was moved into the kubernetes-client repository).

The mocked GET request contains some parameters. The order of the parameters matters when mocking the requests (...which is annyoying. I didn't find a way to make this order-agnostic). Anyway, updating the GET path makes the test succeed again.

For 1.19, I'm gonna go ahead and upgrade kubernetes-client to v6.9.2 to include all LeaderElector-related bug fixes. I don't see a reason to upgrade to v6.10.0 (release notes). There are no compelling changes and I rather keep the version bump small: We would have to apply more code changes on our side (due to fabric8io/kubernetes-client@fd50fa96#diff-375ebadb4285b8214fe6209c8e1758b3cd21f17f9637fb1206173026c0c033d3R65 ).

Any objections from your side, @gyfora (on the decision to only upgrade to v6.9.2).

gyfora

Looks good @XComp thanks for taking the time to fix this!

XComp · 2024-01-31T08:09:45Z

Thanks, @gyfora . I squashed the branch into proper commits. @wangyang0918 do you have anything to add here?

…@AfterEach methods

…e rely on a BlockingQueue This allows us to wait for tasks to "arrive".

…calls This way, we can use FlinkAssertions#assertThatFuture and use assertion messages instead of comments.

…#run() call v5.12.4 allowed us to reuse the LeaderElector. With v6.6.2 (fabric8io/kubernetes-client#4125) this behavior changed. One LeaderElector can only be used until the leadership is lost. An ITCase is added to cover the scenario where the leadership is lost.

…pache#5464 fabric8io/kubernetes-client#5463

wangyang0918

Thanks for updating this PR. LGTM.

+1 for merging.

XComp force-pushed the FLINK-34007 branch 5 times, most recently from d903e8a to bb55379 Compare January 23, 2024 16:56

XComp marked this pull request as ready for review January 23, 2024 16:56

XComp force-pushed the FLINK-34007 branch from bb55379 to f1f440c Compare January 23, 2024 17:25

XComp requested a review from wangyang0918 January 24, 2024 08:37

wangyang0918 reviewed Jan 26, 2024

View reviewed changes

...est/java/org/apache/flink/kubernetes/kubeclient/resources/KubernetesLeaderElectorITCase.java Show resolved Hide resolved

.../src/main/java/org/apache/flink/kubernetes/kubeclient/resources/KubernetesLeaderElector.java Outdated Show resolved Hide resolved

XComp force-pushed the FLINK-34007 branch 3 times, most recently from e499f87 to 694f261 Compare January 26, 2024 12:57

wangyang0918 reviewed Jan 29, 2024

View reviewed changes

.../src/main/java/org/apache/flink/kubernetes/kubeclient/resources/KubernetesLeaderElector.java Outdated Show resolved Hide resolved

.../src/main/java/org/apache/flink/kubernetes/kubeclient/resources/KubernetesLeaderElector.java Outdated Show resolved Hide resolved

XComp force-pushed the FLINK-34007 branch from 3855035 to 9ed1a44 Compare January 29, 2024 13:08

XComp force-pushed the FLINK-34007 branch from 9ed1a44 to bb0e1fd Compare January 29, 2024 15:01

XComp force-pushed the FLINK-34007 branch from bb0e1fd to a1c6570 Compare January 30, 2024 13:05

gyfora approved these changes Jan 30, 2024

View reviewed changes

XComp force-pushed the FLINK-34007 branch from a1c6570 to 0719ea9 Compare January 31, 2024 08:08

XComp added 4 commits January 31, 2024 16:44

[hotfix][test] Moves ConfigMap lifecycle management into @BeforeEach/…

ef18ce1

…@AfterEach methods

[hotfix][test] Makes ManuallyTriggeredScheduledExecutorService#execut…

5f30bb1

…e rely on a BlockingQueue This allows us to wait for tasks to "arrive".

[hotfix][test] Refactors TestingLeaderCallbackHandler to allow async …

ae62b5d

…calls This way, we can use FlinkAssertions#assertThatFuture and use assertion messages instead of comments.

[FLINK-34007][k8s] Upgrade k8s client to v6.9.2 to cover client issue a…

3fd82f5

…pache#5464 fabric8io/kubernetes-client#5463

XComp force-pushed the FLINK-34007 branch from 0719ea9 to 3fd82f5 Compare January 31, 2024 15:49

wangyang0918 approved these changes Feb 1, 2024

View reviewed changes

XComp merged commit 95417a4 into apache:master Feb 1, 2024

flinkbot added the component=Runtime/Coordination label Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-34007][k8s] Adds workaround that fixes the deadlock when renewing the leadership lease fails #24132

[FLINK-34007][k8s] Adds workaround that fixes the deadlock when renewing the leadership lease fails #24132

XComp commented Jan 18, 2024 •

edited

Loading

flinkbot commented Jan 18, 2024 •

edited

Loading

XComp commented Jan 24, 2024

XComp commented Jan 24, 2024

wangyang0918 left a comment

XComp commented Jan 29, 2024

XComp commented Jan 29, 2024 •

edited

Loading

XComp commented Jan 29, 2024

XComp commented Jan 30, 2024 •

edited

Loading

gyfora left a comment

XComp commented Jan 31, 2024

wangyang0918 left a comment

[FLINK-34007][k8s] Adds workaround that fixes the deadlock when renewing the leadership lease fails #24132

[FLINK-34007][k8s] Adds workaround that fixes the deadlock when renewing the leadership lease fails #24132

Conversation

XComp commented Jan 18, 2024 • edited Loading

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

flinkbot commented Jan 18, 2024 • edited Loading

CI report:

XComp commented Jan 24, 2024

XComp commented Jan 24, 2024

wangyang0918 left a comment

Choose a reason for hiding this comment

XComp commented Jan 29, 2024

XComp commented Jan 29, 2024 • edited Loading

XComp commented Jan 29, 2024

XComp commented Jan 30, 2024 • edited Loading

gyfora left a comment

Choose a reason for hiding this comment

XComp commented Jan 31, 2024

wangyang0918 left a comment

Choose a reason for hiding this comment

XComp commented Jan 18, 2024 •

edited

Loading

flinkbot commented Jan 18, 2024 •

edited

Loading

XComp commented Jan 29, 2024 •

edited

Loading

XComp commented Jan 30, 2024 •

edited

Loading