Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: multi controller run concurrently after leadership lost #2309

Merged
merged 2 commits into from
Sep 11, 2024

Conversation

drivebyer
Copy link
Contributor

Reproduction Steps:

  1. Start two operators locally, named minio-operator-1 and minio-operator-2.
  2. minio-operator-1 holds the lease for minio-operator-lock.
  3. Manually change the HOLDER of minio-operator-lock to minio-operator-2.
  4. The leader is now minio-operator-2.

At this point, the logs for minio-operator-1 show the following:
截屏2024-09-03 16 34 51

The main-controller continues to run even after losing leadership.

After the PR

Follow the same steps as above.

Now, the logs for minio-operator-1 show the following:
截屏2024-09-03 16 57 17

minio-operator-1 stops and exits. If we run it in Kubernetes, the container will restart and attempt to acquire the lease. It will then detect that minio-operator-2 has become the leader and will remain inactive with no controller running. That's what we want.

@harshavardhana
Copy link
Member

Why would you manually make another operator leader?

@drivebyer
Copy link
Contributor Author

@harshavardhana, that's to simulate an unstable k8s apiserver.

@drivebyer
Copy link
Contributor Author

I believe the leadership changes happen frequently when the Kubernetes API server is unstable. This has occurred a few times in my environment. @harshavardhana

@drivebyer drivebyer requested a review from pjuarezd September 10, 2024 02:12
drivebyer and others added 2 commits September 10, 2024 23:22
* stop sync handler and queue when no longer leader
* bugfix: `for select` was trying to evaluate if the upgrade and sts webservers had an error (which is not, is just leader lost) and need to restart, resulted in NPE.

```
goroutine 30 [running]:
github.com/minio/operator/pkg/controller.leaderRun({0x1f1ec28, 0x4000a8a370}, 0x4000958b00, 0x2, 0x40005b4380, 0x40005b4ee0)
    github.com/minio/operator/pkg/controller/main-controller.go:495 +0x6c4
github.com/minio/operator/pkg/controller.(*Controller).Start.func2({0x1f1ec28?, 0x4000a8a370?})
    github.com/minio/operator/pkg/controller/main-controller.go:584 +0x34
created by k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run in goroutine 1
    k8s.io/[email protected]/tools/leaderelection/leaderelection.go:213 +0xe4
Stream closed EOF for minio-operator/minio-operator-787db96fd4-cfd4b (minio-operator)
```

Signed-off-by: pjuarezd <[email protected]>

lint: we don't need an additional method called shutdown to invoke SIGTERM

Signed-off-by: pjuarezd <[email protected]>
@harshavardhana harshavardhana merged commit fee9a79 into minio:master Sep 11, 2024
21 checks passed
@drivebyer drivebyer deleted the fix-multi-replica branch September 11, 2024 08:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants