-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Leader election mechanism does not call all the time the stopLeading action #2056
Comments
sounds right @ashangit , made a PR accordingly. Thank you! |
cc @shawkins |
There is an upstream fix for this now. |
thx @shawkins , I think either way this can still stay false. |
If you are not using cancel, then yes it won't make a difference. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Bug Report
What did you do?
We are currently running the flink apache operator which rely on the java-operator-sdk v4.3.5
We have configured it to run 3 operators relying on the leader mechanism based on k8s lease.
During some slownesses faced on a k8s cluster we have reached a point where 2 of the running operator were acting as leader.
From the log we can see that a new leader has acquired the lock while the "old" leader has not been killed as expected from the stopLeading callback
What did you expect to see?
I would expect to see the "old" leader to be killed reaching the
System.exit(1)
in stopLeading callbackWhat did you see instead? Under which circumstances?
The leader has not been stopped and has continue to act as a leader taking decision on events for the flink apache operator while a new one was also running.
Also from k8s audit logs we can see that the "old" leader is not anymore checking the lease (no get request issue on the lease)
Environment
Kubernetes cluster type:
vanilla
$ Mention java-operator-sdk version from pom.xml file
v4.3.5
$ java -version
openjdk version "11.0.20.1" 2023-08-22 LTS
OpenJDK Runtime Environment Corretto-11.0.20.9.1 (build 11.0.20.1+9-LTS)
OpenJDK 64-Bit Server VM Corretto-11.0.20.9.1 (build 11.0.20.1+9-LTS, mixed mode)
$ kubectl version
N/A
Possible Solution
From
LeaderElectionManager
source code the the LeaderElector is build with releaseOnCancel at true.This option was indeed not used on fabric.io:kubernetes-client-api v6.0.0
But with the version which is now use(fabric.io:kubernetes-client-api v6.7.2) this parameter change the behaviour of the stop leading by calling the release method instead of directly call the stopLeading callback.
This
release
method first get the current leader and check if it is the current leaderIf it is not it just return.
During the slownesses faced on the k8s cluster the get and patch request where slower/timeout.
We reach a point where:
renew deadline reached after
release
method which get the leader from lease and check if it is the leader (which is not the case anymore). Finally just stop the leader election mechanism and kept the pod upOne solution would be to enforce
releaseOnCancel
to falseAdditional context
The text was updated successfully, but these errors were encountered: