Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Should we restart all pods to resync cache after vapp was updated ? #11

Open
Eikykun opened this issue Jan 19, 2022 · 3 comments
Open

Comments

@Eikykun
Copy link
Member

Eikykun commented Jan 19, 2022

What happened:
If vapp specHash changed, the proxy container will block leader election to restart main container.
But only leader pod restarted, other proxy container always wait for main container to restart.

Why it happened:
In controller-runtime LeaderElector, it has 2 loops to run the leader election:

  1. acquire() lock
  2. renew() lock

Only 2 will panic after catch an error.
Leader pod in loop 2, but other pods in loop 1.

What you expected to happen:
My expectation is to restart all pods after vapp specHash changed...

@FillZpp
Copy link
Member

FillZpp commented Jan 20, 2022

@Eikykun Yeah, that would be a serious problem. Maybe a tricky solution:

  1. Return a StatusNotFound for leaderelection Get.
  2. Return a mock success for leaderelection Create, but we don't really send it to apiserver. At this moment, this controller will take for it has been leader.
  3. Return StatusNotAcceptable for next Get, so that controller will fail to renew and exit.

The question is controller will be leader for 1~2s in this way. I'm not sure if it is acceptable or it there some better solutions?

@Eikykun
Copy link
Member Author

Eikykun commented Jan 20, 2022

@Eikykun Yeah, that would be a serious problem. Maybe a tricky solution:

  1. Return a StatusNotFound for leaderelection Get.
  2. Return a mock success for leaderelection Create, but we don't really send it to apiserver. At this moment, this controller will take for it has been leader.
  3. Return StatusNotAcceptable for next Get, so that controller will fail to renew and exit.

The question is controller will be leader for 1~2s in this way. I'm not sure if it is acceptable or it there some better solutions?

Only one pod is allowed to be leader for safety...
But all pods will be leaders in this scheme?Maybe we need to design a process to restart pods sequentially.

@FillZpp
Copy link
Member

FillZpp commented Jan 20, 2022

I'm just thinking... How about trigger the controller container restart by specific liveness probe?

When ctrlmesh webhook injects ctrlmesh-init and ctrlmesh-proxy containers into pods, it can also set the liveness probe for the original controller container. For example it checks a file in shared volume, then ctrlmesh-proxy can trigger the container to be restarted by deleting the file...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants