Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guard against possible race condition in new CNI #4560

Closed
slonka opened this issue Jul 7, 2022 · 2 comments · Fixed by #4650
Closed

Guard against possible race condition in new CNI #4560

slonka opened this issue Jul 7, 2022 · 2 comments · Fixed by #4650
Assignees
Labels
kind/feature New feature triage/accepted The issue was reviewed and is complete enough to start working on it

Comments

@slonka
Copy link
Contributor

slonka commented Jul 7, 2022

Description

TLDR:

If an application pod starts before CNI installer installs CNI, Kubelet has no knowledge of the CNI plugin. The result is that the application pod comes up without iptables rules.

The race condition is best described in the Istio design doc

Istio first solved the issue by labeling the broken pods with a specific label and then killing them (PR). However this approach causes additional churning during a node start. It also produces a number of errors in customer logs and lacks portability. So they switched to the taint controller approach implemented in this PR.

Possible solutions:

  1. Reimplement the taint controller
  2. At DP startup check if transparent proxy is set and fail the start if it's not.
  3. Try to revive "KEP: Node Readiness Gates #1003"
  4. Base this on pod healthcheck

WDYT?

@slonka slonka added triage/pending This issue will be looked at on the next triage meeting kind/feature New feature labels Jul 7, 2022
@jakubdyszkiewicz
Copy link
Contributor

Triage: Let's go with option 1. implemented in CP. The extra K8S RBAC (update node) should be only set when CNI is enabled.

@jakubdyszkiewicz jakubdyszkiewicz added triage/accepted The issue was reviewed and is complete enough to start working on it and removed triage/pending This issue will be looked at on the next triage meeting labels Jul 11, 2022
@slonka
Copy link
Contributor Author

slonka commented Jul 21, 2022

I started implementing this and it seems that if we had a default probe (like testserver in the test framework has it and demo-client does not) that will not go healthy if the traffic is not redirected to Envoy that would prevent workloads from being accessible. If we couple that with having the CNI on startup getting a list of pods and injecting the rules inside those pods then we should be fine against this race condition. WDYT? decided to stick with the controller

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature New feature triage/accepted The issue was reviewed and is complete enough to start working on it
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants