Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-34007][k8s] Adds workaround that fixes the deadlock when renewing the leadership lease fails #24132

Merged
merged 5 commits into from
Feb 1, 2024

Conversation

XComp
Copy link
Contributor

@XComp XComp commented Jan 18, 2024

What is the purpose of the change

fabric8io k8s client v6.6.2 was updated from v5.12.4 in Flink 1.18.0 (FLINK-31997). This major version upgrade brought a change of behavior in the LeaderElector. The old implementation allowed us to reuse the LeaderElector instance. The new implementation expects to use one instance per leadership lifecycle (see https://github.com/fabric8io/kubernetes-client/pull/4125/files).

See FLINK-34007 for further details.

Brief change log

  • Switches from reusing the fabric8io LeaderElector to re-instantiating it

Verifying this change

  • ITCase was added to cover the scenario of FLINK-34007. It fails w/o the fix of this PR but succeeds with the fix being included.

The test can be verified locally by running minikube locally and setting the ITCASE_KUBECONFIG environment variable to the kube config of minikube.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: yes
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

@flinkbot
Copy link
Collaborator

flinkbot commented Jan 18, 2024

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@XComp XComp force-pushed the FLINK-34007 branch 5 times, most recently from d903e8a to bb55379 Compare January 23, 2024 16:56
@XComp XComp marked this pull request as ready for review January 23, 2024 16:56
@XComp
Copy link
Contributor Author

XComp commented Jan 24, 2024

ci failure is caused by some infrastructure instability in the compile step of the connect stage:

Jan 23 18:59:09 18:59:09.769 [WARNING] locking FileBasedConfig[/home/agent01_azpcontainer/.config/jgit/config] failed after 5 retries

@XComp
Copy link
Contributor Author

XComp commented Jan 24, 2024

@flinkbot run azure

@XComp XComp requested a review from wangyang0918 January 24, 2024 08:37
Copy link
Contributor

@wangyang0918 wangyang0918 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for creating this PR. It definitely fixed the issue we have already discussed in FLINK-34007. I like the newly added ITCase and just left two minor comments

@XComp XComp force-pushed the FLINK-34007 branch 3 times, most recently from e499f87 to 694f261 Compare January 26, 2024 12:57
@XComp
Copy link
Contributor Author

XComp commented Jan 29, 2024

@flinkbot run azure

@XComp
Copy link
Contributor Author

XComp commented Jan 29, 2024

hm, the CI timeout should be related to the k8s client version update. I'm not able to reproduce it locally, though (after 3000 repetitions 🤔 ). I will check the release notes tomorrow.

@XComp
Copy link
Contributor Author

XComp commented Jan 29, 2024

@flinkbot run azure

@XComp
Copy link
Contributor Author

XComp commented Jan 30, 2024

Ok, looks like the timeout was caused by some changes to the MockServer (which was moved into the kubernetes-client repository).

The mocked GET request contains some parameters. The order of the parameters matters when mocking the requests (...which is annyoying. I didn't find a way to make this order-agnostic). Anyway, updating the GET path makes the test succeed again.

For 1.19, I'm gonna go ahead and upgrade kubernetes-client to v6.9.2 to include all LeaderElector-related bug fixes. I don't see a reason to upgrade to v6.10.0 (release notes). There are no compelling changes and I rather keep the version bump small: We would have to apply more code changes on our side (due to fabric8io/kubernetes-client@fd50fa96#diff-375ebadb4285b8214fe6209c8e1758b3cd21f17f9637fb1206173026c0c033d3R65 ).

Any objections from your side, @gyfora (on the decision to only upgrade to v6.9.2).

Copy link
Contributor

@gyfora gyfora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good @XComp thanks for taking the time to fix this!

@XComp
Copy link
Contributor Author

XComp commented Jan 31, 2024

Thanks, @gyfora . I squashed the branch into proper commits. @wangyang0918 do you have anything to add here?

XComp added 4 commits January 31, 2024 16:44
…e rely on a BlockingQueue

This allows us to wait for tasks to "arrive".
…calls

This way, we can use FlinkAssertions#assertThatFuture and use assertion messages instead of comments.
…#run() call

v5.12.4 allowed us to reuse the LeaderElector. With v6.6.2 (fabric8io/kubernetes-client#4125) this behavior changed. One LeaderElector can only be used until the leadership is lost.
An ITCase is added to cover the scenario where the leadership is lost.
Copy link
Contributor

@wangyang0918 wangyang0918 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating this PR. LGTM.

+1 for merging.

@XComp XComp merged commit 95417a4 into apache:master Feb 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants