Leader election mechanism does not call all the time the stopLeading action #2056

ashangit · 2023-09-14T15:33:54Z

Bug Report

What did you do?

We are currently running the flink apache operator which rely on the java-operator-sdk v4.3.5
We have configured it to run 3 operators relying on the leader mechanism based on k8s lease.

During some slownesses faced on a k8s cluster we have reached a point where 2 of the running operator were acting as leader.
From the log we can see that a new leader has acquired the lock while the "old" leader has not been killed as expected from the stopLeading callback

What did you expect to see?

I would expect to see the "old" leader to be killed reaching the System.exit(1) in stopLeading callback

What did you see instead? Under which circumstances?

The leader has not been stopped and has continue to act as a leader taking decision on events for the flink apache operator while a new one was also running.
Also from k8s audit logs we can see that the "old" leader is not anymore checking the lease (no get request issue on the lease)

Environment

Kubernetes cluster type:

vanilla

$ Mention java-operator-sdk version from pom.xml file

v4.3.5

$ java -version

openjdk version "11.0.20.1" 2023-08-22 LTS
OpenJDK Runtime Environment Corretto-11.0.20.9.1 (build 11.0.20.1+9-LTS)
OpenJDK 64-Bit Server VM Corretto-11.0.20.9.1 (build 11.0.20.1+9-LTS, mixed mode)

$ kubectl version

N/A

Possible Solution

From LeaderElectionManager source code the the LeaderElector is build with releaseOnCancel at true.
This option was indeed not used on fabric.io:kubernetes-client-api v6.0.0

But with the version which is now use(fabric.io:kubernetes-client-api v6.7.2) this parameter change the behaviour of the stop leading by calling the release method instead of directly call the stopLeading callback.

This release method first get the current leader and check if it is the current leader
If it is not it just return.

During the slownesses faced on the k8s cluster the get and patch request where slower/timeout.
We reach a point where:

old leader took too many time to patch the lease
new leader take the lead and patch the lease
old leader face a 409 http code while patching (expected as the lease has already been updated)
old leader reach renew deadline reached after
old leader execute release method which get the leader from lease and check if it is the leader (which is not the case anymore). Finally just stop the leader election mechanism and kept the pod up

One solution would be to enforce releaseOnCancel to false

Additional context

The text was updated successfully, but these errors were encountered:

csviri · 2023-09-18T07:34:48Z

sounds right @ashangit , made a PR accordingly. Thank you!

csviri · 2023-09-19T08:47:22Z

cc @shawkins

shawkins · 2023-09-19T10:25:23Z

There is an upstream fix for this now.

csviri · 2023-09-19T10:39:50Z

thx @shawkins , I think either way this can still stay false.

shawkins · 2023-09-19T10:44:04Z

thx @shawkins , I think either way this can still stay false.

If you are not using cancel, then yes it won't make a difference.

csviri self-assigned this Sep 15, 2023

csviri linked a pull request Sep 18, 2023 that will close this issue

fix: leader election stop not called #2059

Merged

shawkins mentioned this issue Sep 19, 2023

LeaderElector doesn't call onStopLeading fabric8io/kubernetes-client#5463

Closed

csviri closed this as completed in #2059 Sep 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Leader election mechanism does not call all the time the stopLeading action #2056

Leader election mechanism does not call all the time the stopLeading action #2056

ashangit commented Sep 14, 2023 •

edited

Loading

csviri commented Sep 18, 2023 •

edited

Loading

csviri commented Sep 19, 2023

shawkins commented Sep 19, 2023

csviri commented Sep 19, 2023

shawkins commented Sep 19, 2023

Leader election mechanism does not call all the time the stopLeading action #2056

Leader election mechanism does not call all the time the stopLeading action #2056

Comments

ashangit commented Sep 14, 2023 • edited Loading

Bug Report

What did you do?

What did you expect to see?

What did you see instead? Under which circumstances?

Environment

Possible Solution

Additional context

csviri commented Sep 18, 2023 • edited Loading

csviri commented Sep 19, 2023

shawkins commented Sep 19, 2023

csviri commented Sep 19, 2023

shawkins commented Sep 19, 2023

ashangit commented Sep 14, 2023 •

edited

Loading

csviri commented Sep 18, 2023 •

edited

Loading