Skip to content

Commit

Permalink
Promote STS minReadySeconds to beta
Browse files Browse the repository at this point in the history
  • Loading branch information
ravisantoshgudimetla committed Sep 9, 2021
1 parent 18d3f20 commit ef9d1a8
Show file tree
Hide file tree
Showing 3 changed files with 37 additions and 10 deletions.
2 changes: 2 additions & 0 deletions keps/prod-readiness/sig-apps/2599.yaml
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
kep-number: 2599
alpha:
approver: "@ehashman"
beta:
approver: "@ehashman"
41 changes: 33 additions & 8 deletions keps/sig-apps/2599-minreadyseconds-for-statefulsets/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -403,16 +403,23 @@ This section must be completed when targeting beta to a release.
Try to be as paranoid as possible - e.g., what if some components will restart
mid-rollout?
-->
It shouldn't impact already running workloads. This is an opt-in feature since
users need to explicitly set the minReadySeconds parameter in the StatefulSet spec i.e `.spec.minReadySeconds` field.
If the feature is disabled the field is preserved. If it was already set in the persisted StatefulSet object, otherwise it is silently dropped.

###### What specific metrics should inform a rollback?

<!--
What signals should users be paying attention to when the feature is young
that might indicate a serious problem?
-->
We have a metric called `kube_statefulset_status_replicas_available`
which we added recently to track the number of available replicas. The cluster-admin could use
this metric to track the problems. If the value is immediately equal to the value of `Ready` replicas or if it is `0`, it can be considered as a feature failure.

###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Manually tested. No issues were found when we enabled the feature gate -> disabled it ->
re-enabled the feature gate. We still need to test upgrade -> downgrade -> upgrade scenario.
<!--
Describe manual testing that was done and the outcomes.
Longer term, we may want to require automated upgrade/rollback tests, but we
Expand All @@ -424,7 +431,7 @@ are missing a bunch of machinery and tooling and can't do that now.
<!--
Even if applying deprecation policies, they may still surprise some users.
-->

None
### Monitoring Requirements

<!--
Expand All @@ -438,19 +445,21 @@ Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
checking if there are objects with field X set) may be a last resort. Avoid
logs or events for this purpose.
-->
By checking the `kube_statefulset_status_replicas_available` metric. If all the `Ready` replicas are accounted for in `kube_statefulset_status_replicas_available` after waiting for `minReadySeconds`, we can consider the feature to be in use by workloads.

###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

<!--
Pick one more of these and delete the rest.
-->

- [ ] Metrics
- Metric name:
- [x] Metrics
- Metric name: `kube_statefulset_status_replicas_available`
- [Optional] Aggregation method:
- Components exposing the metric:
- [ ] Other (treat as last resort)
- Details:
- Components exposing the metric: kube-controller-manager via kube_state_metrics

The `kube_statefulset_status_replicas_available` gives the number of replicas available. Since the
`kube_statefulset_status_replicas_available` metric tracks available replicas, comparing it with `kube_statefulset_status_replicas_ready` metric should give us an understanding of the health of the feature. There should be certain times where `kube_statefulset_status_replicas_available` lags behind `kube_statefulset_status_replicas_ready` for a duration of minReadySeconds. This lag defines the correctness of the functionality.

###### What are the reasonable SLOs (Service Level Objectives) for the above SLIs?

Expand All @@ -463,6 +472,7 @@ high level (needs more precise definitions) those may be things like:
job creation time) for cron job <= 10%
- 99,9% of /health requests per day finish with 200 code
-->
All the `Available` pods created should be more than the time specified in `.spec.minReadySeconds` 99% of the time.

###### Are there any missing metrics that would be useful to have to improve observability of this feature?

Expand Down Expand Up @@ -493,6 +503,7 @@ and creating new ones, as well as about cluster-level services (e.g. DNS):
- Impact of its outage on the feature:
- Impact of its degraded performance or high-error rates on the feature:
-->
None. It is part of the StatefulSet controller.

### Scalability

Expand Down Expand Up @@ -589,6 +600,8 @@ details). For now, we leave it here.

###### How does this feature react if the API server and/or etcd is unavailable?

The controller won't be able to make progress, all currently queued resources are re-queued. This feature does not change current behavior of the controller in this regard.

###### What are other known failure modes?

<!--
Expand All @@ -603,11 +616,23 @@ For each of them, fill in the following information by copying the below templat
Not required until feature graduated to beta.
- Testing: Are there any tests for failure mode? If not, describe why.
-->
- `minReadySeconds` not respected and all the pods are shown `Available` immediately
- Detection: Looking at `kube_statefulset_status_replicas_available` metric
- Mitigations: Disable the `StatefulSetMinReadySeconds` feature flag
- Diagnostics: Controller-manager when starting at log-level 4 and above
- Testing: Yes, e2e tests are already in place
- `minReadySeconds` not respected and none of the pods are shown as `Available` after `minReadySeconds`
- Detection: Looking at `kube_statefulset_status_replicas_available`. None of the pods will be shown available
- Mitigations: Disable the `StatefulSetMinReadySeconds` feature flag
- Diagnostics: Controller-manager when starting at log-level 4 and above
- Testing: Yes, e2e tests are already in place

###### What steps should be taken if SLOs are not being met to determine the problem?

## Implementation History

- 2021-04-29: Initial KEP merged
- 2021-06-15: Initial implementation PR merged
- 2021-07-14: Graduate the feature to Beta proposed
<!--
Major milestones in the lifecycle of a KEP should be tracked in this section.
Major milestones might include:
Expand Down
4 changes: 2 additions & 2 deletions keps/sig-apps/2599-minreadyseconds-for-statefulsets/kep.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,12 +19,12 @@ see-also:


# The target maturity stage in the current dev cycle for this KEP.
stage: alpha
stage: beta

# The most recent milestone for which work toward delivery of this KEP has been
# done. This can be the current (upcoming) milestone, if it is being actively
# worked on.
latest-milestone: "v1.22"
latest-milestone: "v1.23"

# The milestone at which this feature was, or is targeted to be, at each stage.
milestone:
Expand Down

0 comments on commit ef9d1a8

Please sign in to comment.