-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Promote STS minReadySeconds to beta #2824
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,5 @@ | ||
kep-number: 2599 | ||
alpha: | ||
approver: "@ehashman" | ||
beta: | ||
approver: "@ehashman" |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -403,16 +403,23 @@ This section must be completed when targeting beta to a release. | |
Try to be as paranoid as possible - e.g., what if some components will restart | ||
mid-rollout? | ||
--> | ||
It shouldn't impact already running workloads. This is an opt-in feature since | ||
users need to explicitly set the minReadySeconds parameter in the StatefulSet spec i.e `.spec.minReadySeconds` field. | ||
If the feature is disabled the field is preserved. If it was already set in the persisted StatefulSet object, otherwise it is silently dropped. | ||
|
||
###### What specific metrics should inform a rollback? | ||
|
||
<!-- | ||
What signals should users be paying attention to when the feature is young | ||
that might indicate a serious problem? | ||
--> | ||
We have a metric called `kube_statefulset_status_replicas_available` | ||
which we added recently to track the number of available replicas. The cluster-admin could use | ||
this metric to track the problems. If the value is immediately equal to the value of `Ready` replicas or if it is `0`, it can be considered as a feature failure. | ||
|
||
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? | ||
|
||
Manually tested. No issues were found when we enabled the feature gate -> disabled it -> | ||
re-enabled the feature gate. We still need to test upgrade -> downgrade -> upgrade scenario. | ||
<!-- | ||
Describe manual testing that was done and the outcomes. | ||
Longer term, we may want to require automated upgrade/rollback tests, but we | ||
|
@@ -424,7 +431,7 @@ are missing a bunch of machinery and tooling and can't do that now. | |
<!-- | ||
Even if applying deprecation policies, they may still surprise some users. | ||
--> | ||
|
||
None | ||
### Monitoring Requirements | ||
|
||
<!-- | ||
|
@@ -438,19 +445,21 @@ Ideally, this should be a metric. Operations against the Kubernetes API (e.g., | |
checking if there are objects with field X set) may be a last resort. Avoid | ||
logs or events for this purpose. | ||
--> | ||
By checking the `kube_statefulset_status_replicas_available` metric. If all the `Ready` replicas are accounted for in `kube_statefulset_status_replicas_available` after waiting for `minReadySeconds`, we can consider the feature to be in use by workloads. | ||
|
||
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? | ||
ravisantoshgudimetla marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
<!-- | ||
Pick one more of these and delete the rest. | ||
--> | ||
|
||
- [ ] Metrics | ||
- Metric name: | ||
- [x] Metrics | ||
- Metric name: `kube_statefulset_status_replicas_available` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe availability ratio will be a better option for this one? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we can get the same ratio number using ready and available metrics. I don't have a strong opinion one or the other. Having said that, the current metrics are coming from kube_state_metrics and if we decide to go with this metric, we have to add the same metric for all the controllers. We can also revisit this during implementation. |
||
- [Optional] Aggregation method: | ||
- Components exposing the metric: | ||
- [ ] Other (treat as last resort) | ||
- Details: | ||
- Components exposing the metric: kube-controller-manager via kube_state_metrics. [PR which adds the metric](https://github.com/kubernetes/kube-state-metrics/pull/1532) | ||
|
||
The `kube_statefulset_status_replicas_available` gives the number of replicas available. Since the | ||
ehashman marked this conversation as resolved.
Show resolved
Hide resolved
|
||
`kube_statefulset_status_replicas_available` metric tracks available replicas, comparing it with `kube_statefulset_status_replicas_ready` metric should give us an understanding of the health of the feature. There should be certain times where `kube_statefulset_status_replicas_available` lags behind `kube_statefulset_status_replicas_ready` for a duration of minReadySeconds. This lag defines the correctness of the functionality. | ||
|
||
###### What are the reasonable SLOs (Service Level Objectives) for the above SLIs? | ||
|
||
|
@@ -463,6 +472,7 @@ high level (needs more precise definitions) those may be things like: | |
job creation time) for cron job <= 10% | ||
- 99,9% of /health requests per day finish with 200 code | ||
--> | ||
All the `Available` pods created should be more than the time specified in `.spec.minReadySeconds` 99% of the time. | ||
|
||
###### Are there any missing metrics that would be useful to have to improve observability of this feature? | ||
|
||
|
@@ -493,6 +503,7 @@ and creating new ones, as well as about cluster-level services (e.g. DNS): | |
- Impact of its outage on the feature: | ||
- Impact of its degraded performance or high-error rates on the feature: | ||
--> | ||
None. It is part of the StatefulSet controller. | ||
|
||
### Scalability | ||
|
||
|
@@ -589,6 +600,8 @@ details). For now, we leave it here. | |
|
||
###### How does this feature react if the API server and/or etcd is unavailable? | ||
|
||
The controller won't be able to make progress, all currently queued resources are re-queued. This feature does not change current behavior of the controller in this regard. | ||
|
||
###### What are other known failure modes? | ||
|
||
<!-- | ||
|
@@ -603,11 +616,23 @@ For each of them, fill in the following information by copying the below templat | |
Not required until feature graduated to beta. | ||
- Testing: Are there any tests for failure mode? If not, describe why. | ||
--> | ||
- `minReadySeconds` not respected and all the pods are shown `Available` immediately | ||
ravisantoshgudimetla marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- Detection: Looking at `kube_statefulset_status_replicas_available` metric | ||
- Mitigations: Disable the `StatefulSetMinReadySeconds` feature flag | ||
- Diagnostics: Controller-manager when starting at log-level 4 and above | ||
- Testing: Yes, e2e tests are already in place | ||
- `minReadySeconds` not respected and none of the pods are shown as `Available` after `minReadySeconds` | ||
- Detection: Looking at `kube_statefulset_status_replicas_available`. None of the pods will be shown available | ||
- Mitigations: Disable the `StatefulSetMinReadySeconds` feature flag | ||
- Diagnostics: Controller-manager when starting at log-level 4 and above | ||
- Testing: Yes, e2e tests are already in place | ||
|
||
###### What steps should be taken if SLOs are not being met to determine the problem? | ||
|
||
## Implementation History | ||
|
||
- 2021-04-29: Initial KEP merged | ||
- 2021-06-15: Initial implementation PR merged | ||
- 2021-07-14: Graduate the feature to Beta proposed | ||
<!-- | ||
Major milestones in the lifecycle of a KEP should be tracked in this section. | ||
Major milestones might include: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ravisantoshgudimetla let's focus just on the metrics, it'll be simpler and the question goal is to help cluster-admin identify problems with the feature (when enabled) and not necessarily regular users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed it.