You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This may be normally achieved via a service/LB in front of master nodes in the cluster. But this requires a lot of hard work in typical infrastructure. People also may don't want to bring additional overhead to bring some additional resiliency just for Vault Kubernetes auth method.
Problem
We are (@yilmazo@erkanzileli@developer-guy) filed this issue because one of the our master nodes (the one we set to kubernetes_host variable) has down and caused an incident in slight time window. We actually don't have a LB in front of master nodes. If we had, we probably wouldn't have this issue.
What if we had a LB, and it gets down in this case? We still don't cover this scenario.
Solution
We should provide a solution that covers the following two scenarios:
for the consumers who don't use LB
for the consumers who use LB already
We should implement a fallback mechanism and provide some resiliency methods:
BREAKING!: make kubernetes_host as a string array: []string
If we get 5xx or similar error from host[0], fallback to host[1]: try host[index] -> host[index+1] until last one.
Resiliency: we should retry while talking with Kubernetes API.
For example, we can implement retryable http when calling API in token review function.
Both following above ideas are essential to provide highly resilient system since we use Vault Kubernetes Auth with a production Kubernetes cluster.
Alternative Solution
create TCP load balancer infra from scratch to put in front of master nodes and ignore this issue
create a Kubernetes Operator from scratch to watch shared informers. If any change observed across master nodes (i.e, if one of down), call Vault API to update kubernetes_host key in auth/kubernetes/config path 2 with another master node that is actively in running and healthy state
Similar problem has been previously discussed at hashicorp/vault#5408 almost 3 years ago, so we came up with the new proposal since the main issue hasn't been resolved yet.
Abstract
In current implementation of
kubernetes_host
only takesstring
type as we see in the scheme. The problem here is that we can only pass:This may be normally achieved via a service/LB in front of master nodes in the cluster. But this requires a lot of hard work in typical infrastructure. People also may don't want to bring additional overhead to bring some additional resiliency just for Vault Kubernetes auth method.
Problem
We are (@yilmazo @erkanzileli @developer-guy) filed this issue because one of the our master nodes (the one we set to
kubernetes_host
variable) has down and caused an incident in slight time window. We actually don't have a LB in front of master nodes. If we had, we probably wouldn't have this issue.What if we had a LB, and it gets down in this case? We still don't cover this scenario.
Solution
We should provide a solution that covers the following two scenarios:
We should implement a fallback mechanism and provide some resiliency methods:
kubernetes_host
as a string array:[]string
If we get 5xx or similar error from host[0], fallback to host[1]: try
host[index] -> host[index+1]
until last one.For example, we can implement retryable http when calling API in token review function.
Both following above ideas are essential to provide highly resilient system since we use Vault Kubernetes Auth with a production Kubernetes cluster.
Alternative Solution
kubernetes_host
key inauth/kubernetes/config
path 2 with another master node that is actively in running and healthy stateSimilar problem has been previously discussed at hashicorp/vault#5408 almost 3 years ago, so we came up with the new proposal since the main issue hasn't been resolved yet.
cc @briankassouf @catsby @jefferai fyi @mitchellmaler @m1kola
Footnotes
https://github.com/hashicorp/vault/issues/5408#issuecomment-640946258 ↩
https://github.com/hashicorp/vault/issues/6987 ↩
The text was updated successfully, but these errors were encountered: