Skip to content

Commit

Permalink
Updates the ttl-to-finished KEP to graduate the feature to Beta.
Browse files Browse the repository at this point in the history
Enhancement issue: #592
  • Loading branch information
ahg-g committed Jan 8, 2021
1 parent c92adb2 commit 8ae83c7
Show file tree
Hide file tree
Showing 3 changed files with 175 additions and 12 deletions.
3 changes: 3 additions & 0 deletions keps/prod-readiness/sig-apps/592.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
kep-number: 592
beta:
approver: "@wojtek-t"
147 changes: 139 additions & 8 deletions keps/sig-apps/592-ttl-after-finish/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,13 @@
- [Owner References](#owner-references)
- [Risks and Mitigations](#risks-and-mitigations)
- [Graduation Criteria](#graduation-criteria)
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
- [Monitoring Requirements](#monitoring-requirements)
- [Dependencies](#dependencies)
- [Scalability](#scalability)
- [Troubleshooting](#troubleshooting)
- [Implementation History](#implementation-history)
<!-- /toc -->

Expand Down Expand Up @@ -250,17 +257,141 @@ Mitigations:

## Graduation Criteria

We want to implement this feature for Pods/Jobs first to gather feedback, and
decide whether to generalize it to custom resources. This feature can be
promoted to beta after we finalize the decision for whether to generalize it or
not, and when it satisfies users' need for cleaning up finished resource
objects, without regressions.
- The feature implemented for Job, as future work, it can be extended to Pods and perhaps custom resources, but that should happen under separate feature flags.
- Add necessary tests
- Graduate to Beta in v1.21
- Graduate to GA in 1.23

This will be promoted to GA once it's gone a sufficient amount of time as beta
with no changes.

[umbrella issues]: https://github.com/kubernetes/kubernetes/issues/42752

## Implementation History
## Production Readiness Review Questionnaire

### Feature Enablement and Rollback

* **How can this feature be enabled / disabled in a live cluster?**
- [x] Feature gate (also fill in values in `kep.yaml`)
- Feature gate name: TTLAfterFinished
- Components depending on the feature gate: kube-apiserver, kube-controller-manager
- [ ] Other
- Describe the mechanism:
- Will enabling / disabling the feature require downtime of the control
plane?
- Will enabling / disabling the feature require downtime or reprovisioning
of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled).

* **Does enabling the feature change any default behavior?**
No.

* **Can the feature be disabled once it has been enabled (i.e. can we roll back
the enablement)?**
Yes.

* **What happens if we reenable the feature if it was previously rolled back?**
It should work as expected.

* **Are there any tests for feature enablement/disablement?**
No.

### Rollout, Upgrade and Rollback Planning

* **How can a rollout fail? Can it impact already running workloads?**
It shouldn't impact already running workloads. This is an opt-in feature since users need to
explicitly set the TTLSecondsAfterFinished parameter in the job spec, which is accepted by the
api-server only if the feature is enabled.

* **What specific metrics should inform a rollback?**
Unexpected restarts of kube-controller-manager

* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
N/A

* **Is the rollout accompanied by any deprecations and/or removals of features, APIs,
fields of API types, flags, etc.?**
No

### Monitoring Requirements

_This section must be completed when targeting beta graduation to a release._

* **How can an operator determine if the feature is in use by workloads?**
- Check the workqueue metrics
- When checking the job object in the api server, the TTLSecondsAfterFinished parameter should
be set if it was specified when the job was created.

* **What are the SLIs (Service Level Indicators) an operator can use to determine
the health of the service?**
- [x] Metrics
- Metric name: `ttl_after_finished_controller_rate_limiter_use`
- Metric name: `workqueue_adds_total`
- Metric name: `workqueue_depth`
- Metric name: `workqueue_queue_duration_seconds`
- Metric name: `workqueue_retries_total`
- Components exposing the metric: `kube-controller-manager`
- Metric name: `etcd_object_counts{resource="jobs.batch"}`
- Components exposing the metric: `kube-apiserver`.

* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**

99% of the jobs that needs cleanup are deleted within X minutes.

* **Are there any missing metrics that would be useful to have to improve observability
of this feature?**

No

### Dependencies

_This section must be completed when targeting beta graduation to a release._

* **Does this feature depend on any specific services running in the cluster?**
No.

### Scalability

* **Will enabling / using this feature result in any new API calls?**
- API call type: DELETE jobs
- Estimated throughput: depends on job creation and completion rate.
- originating component(s): kube-controller-manager

* **Will enabling / using this feature result in introducing new API types?**
No.

* **Will enabling / using this feature result in any new calls to the cloud
provider?**
No.

* **Will enabling / using this feature result in increasing size or count of
the existing API objects?**
No.

* **Will enabling / using this feature result in increasing time taken by any
operations covered by [existing SLIs/SLOs]?**
No.

* **Will enabling / using this feature result in non-negligible increase of
resource usage (CPU, RAM, disk, IO, ...) in any components?**
kube-controller-manager may consume more CPU depending on the number of jobs that require deletion in the system.

### Troubleshooting

_This section must be completed when targeting beta graduation to a release._

* **How does this feature react if the API server and/or etcd is unavailable?**
The controller will not be notified of job updates and it can't deleted existing ones.

* **What are other known failure modes?**
None.

* **What steps should be taken if SLOs are not being met to determine the problem?**
TBD

[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos

## Implementation History
- 2018-08-16: Initial KEP
- 2021-01-08: KEP updated to
- indicate that the feature will be graduated for Jobs, and that Pods will be done as future work under a separate flag
- add production readiness questionnaire
- mark the feature for Beta graduation for jobs.
37 changes: 33 additions & 4 deletions keps/sig-apps/592-ttl-after-finish/kep.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,18 +5,47 @@ authors:
owning-sig: sig-apps
participating-sigs:
- sig-api-machinery
status: implemented
creation-date: 2018-08-16
reviewers:
- "@enisoc"
- "@tnozicka"
approvers:
- "@kow3ns"
editor: TBD
creation-date: 2018-08-16
last-updated: 2018-08-16
status: provisional
prr-approvers:
- "@wojtek-t"
see-also:
- n/a
replaces:
- n/a
superseded-by:
- n/a

# The target maturity stage in the current dev cycle for this KEP.
stage: beta

# The most recent milestone for which work toward delivery of this KEP has been
# done. This can be the current (upcoming) milestone, if it is being actively
# worked on.
latest-milestone: "v1.21"

# The milestone at which this feature was, or is targeted to be, at each stage.
milestone:
alpha: "v1.12"
beta: "v1.21"
stable: "v1.23"

# The following PRR answers are required at alpha release
# List the feature gate name and the components for which it must be enabled
feature-gates:
- name: TTLAfterFinished
components:
- kube-apiserver
- kube-controller-manager
disable-supported: true
metrics:
- "ttl_after_finished_controller_rate_limiter_use"
- "workqueue_adds_total"
- "workqueue_depth"
- "workqueue_queue_duration_seconds"
- "workqueue_retries_total"

0 comments on commit 8ae83c7

Please sign in to comment.