-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Graduate KEP-2879 PodsReady to stable #4237
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -38,17 +38,17 @@ Items marked with (R) are required *prior to targeting to a milestone / release* | |
- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) | ||
- [x] (R) KEP approvers have approved the KEP status as `implementable` | ||
- [x] (R) Design details are appropriately documented | ||
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) | ||
- [ ] e2e Tests for all Beta API Operations (endpoints) | ||
- [ ] (R) Ensure GA e2e tests for meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) | ||
- [x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) | ||
- [x] e2e Tests for all Beta API Operations (endpoints) | ||
- [ ] (R) Ensure GA e2e tests for meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) | ||
mimowo marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free | ||
- [ ] (R) Graduation criteria is in place | ||
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) | ||
- [x] (R) Graduation criteria is in place | ||
- [x] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) | ||
- [x] (R) Production readiness review completed | ||
- [x] (R) Production readiness review approved | ||
- [ ] "Implementation History" section is up-to-date for milestone | ||
- [x] "Implementation History" section is up-to-date for milestone | ||
- [x] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] | ||
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes | ||
- [x] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes | ||
|
||
[kubernetes.io]: https://kubernetes.io/ | ||
[kubernetes/enhancements]: https://git.k8s.io/enhancements | ||
|
@@ -125,6 +125,7 @@ pods that have the `Ready` condition. | |
- Count of ready pods. | ||
- Feature gate disablement. | ||
- Verify passing existing E2E and conformance tests for Job. | ||
- Added e2e test for the count of ready pods. | ||
|
||
### Graduation Criteria | ||
|
||
|
@@ -157,14 +158,14 @@ pods that have the `Ready` condition. | |
#### GA | ||
|
||
- Every bug report is fixed. | ||
- Explore setting different batch periods for regular pod updates versus | ||
finished pod updates, so we can do less pod readiness updates without | ||
compromising how fast we can declare a job finished. | ||
- The job controller ignores the feature gate. | ||
- E2e test for the count of ready pods. | ||
- Lock the feature-gate and document deprecation of the feature-gate | ||
|
||
#### Deprecation | ||
|
||
N/A | ||
In GA+2 release: | ||
- Remove the feature gate definition | ||
- Job controller ignores the feature gate | ||
|
||
### Upgrade / Downgrade Strategy | ||
|
||
|
@@ -210,7 +211,9 @@ The Job controller will start populating the field again. | |
|
||
###### Are there any tests for feature enablement/disablement? | ||
|
||
Yes, there are tests at unit and [integration] level. | ||
Yes, there are unit tests (see [link](https://github.com/kubernetes/kubernetes/blob/e8abe1af8dcb36f65ef7aa7135d4664b3db90e89/pkg/controller/job/job_controller_test.go#L236)). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These are nice feature tests, but (please correct me if I'm wrong), these aren't enablement/disablement tests. Enablement/disablement tests are tests that effectively change the FG in the middle of the test to see what happens if I enable or disable the feature. Do you have any tests like these? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think we have (also could not find). I've only done manual testing of reenabling as described here "Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?". FYI @alculquicondor I can add it to requirements, but I'm not sure the test will work on GA as we plan to lock the feature gate. So, the options I see:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's too late now, exactly because what you wrote (the tets won't work after locking FG). Let's just update the text here to something: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks, added a note on this along the lines. |
||
There are also integration tests (see [link](https://github.com/kubernetes/kubernetes/blob/e8abe1af8dcb36f65ef7aa7135d4664b3db90e89/test/integration/job/job_test.go#L1364) and | ||
[link](https://github.com/kubernetes/kubernetes/blob/e8abe1af8dcb36f65ef7aa7135d4664b3db90e89/test/integration/job/job_test.go#L1517)). | ||
|
||
### Rollout, Upgrade and Rollback Planning | ||
|
||
|
@@ -221,20 +224,20 @@ The field is only informative, it doesn't affect running workloads. | |
###### What specific metrics should inform a rollback? | ||
|
||
- An increase in `job_sync_duration_seconds`. | ||
- A reduction in `job_sync_num`. | ||
- A reduction in `job_syncs_total`. | ||
|
||
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? | ||
|
||
A manual test will be performed, as follows: | ||
A manual test on Beta was performed, as follows: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This testing scenario is updated. One slight change compared to the scenario was point |
||
|
||
1. Create a cluster in 1.23. | ||
1. Upgrade to 1.24. | ||
1. Create long running Job A, ensure that the ready field is populated. | ||
1. Downgrade to 1.23. | ||
1. Verify that ready field in Job A is not lost, but also not updated. | ||
1. Create long running Job B, ensure that ready field is not populated. | ||
1. Upgrade to 1.24. | ||
1. Verify that Job A and B ready field is tracked again. | ||
1. Create a cluster in 1.28 with the `JobReadyPods` disabled (`=false`). | ||
2. Simulate upgrade by modifying control-plane manifests to enable `JobReadyPods`. | ||
3. Create long running Job A, ensure that the ready field is populated. | ||
4. Simulate downgrade by modifying control-plane manifests to disable `JobReadyPods`. | ||
5. Verify that ready field in Job A is cleaned up shortly after the startup of the Job controller completes. | ||
6. Create long running Job B, ensure that ready field is not populated. | ||
7. Simulate upgrade by modifying control-plane manifests to enable `JobReadyPods`. | ||
8. Verify that Job A and B ready field is tracked again shortly after the startup of the Job controller completes. | ||
|
||
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? | ||
|
||
|
@@ -259,7 +262,7 @@ the controller doesn't create new Pods or tracks finishing Pods. | |
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? | ||
|
||
- [x] Metrics | ||
- Metric name: `job_sync_duration_seconds`, `job_sync_total`. | ||
- Metric name: `job_sync_duration_seconds`, `job_syncs_total`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Adjusted the metric name to the actual value: https://github.com/kubernetes/kubernetes/blob/e8abe1af8dcb36f65ef7aa7135d4664b3db90e89/pkg/controller/job/metrics/metrics.go#L52 |
||
- Components exposing the metric: `kube-controller-manager` | ||
|
||
###### Are there any missing metrics that would be useful to have to improve observability of this feature? | ||
|
@@ -281,7 +284,7 @@ No. | |
|
||
Estimated throughput: at most one API call for each Job Pod reaching Ready | ||
condition. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please describe the rate-limitting mechanism (that you mentioned in the comment) here - it's a very important mitigation that has to be mentioned here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added description, let me know if this is enough. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Looks good - thanks. |
||
Originating component: job-controller | ||
|
||
###### Will enabling / using this feature result in introducing new API types? | ||
|
@@ -306,6 +309,10 @@ No. | |
|
||
No. | ||
|
||
###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? | ||
wojtek-t marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
No. | ||
|
||
### Troubleshooting | ||
|
||
###### How does this feature react if the API server and/or etcd is unavailable? | ||
|
@@ -314,8 +321,7 @@ No change from existing behavior of the Job controller. | |
|
||
###### What are other known failure modes? | ||
|
||
- When the cluster has apiservers with skewed versions, the `Job.status.ready` | ||
might remain zero. | ||
No. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not actually aware of any failure modes currently. I'm also not sure why the field would be zero if the API server version is skewed. I assume, for older API servers, which don't know the field we would have |
||
|
||
###### What steps should be taken if SLOs are not being met to determine the problem? | ||
|
||
|
@@ -332,6 +338,7 @@ No change from existing behavior of the Job controller. | |
|
||
- 2021-08-19: Proposed KEP starting in alpha status, including full PRR questionnaire. | ||
- 2022-01-05: Proposed graduation to beta. | ||
- 2022-03-20: Merged [PR#107476](https://github.com/kubernetes/kubernetes/pull/107476) with beta implementation | ||
|
||
## Drawbacks | ||
|
||
|
@@ -346,6 +353,17 @@ Pod created. | |
to accept connections. On the other hand, the `Ready` condition is | ||
configurable through a readiness probe. If the Pod doesn't have a readiness | ||
probe configured, the `Ready` condition is equivalent to the `Running` phase. | ||
In other words, `Job.status.active` provides as the same behavior as | ||
|
||
In other words, `Job.status.ready` provides as the same behavior as | ||
`Job.status.running` with the advantage of it being configurable. | ||
|
||
- We considered exploring different batch periods for regular pod updates versus | ||
finished pod updates, so we can do less pod readiness updates without | ||
compromising how fast we can declare a job finished. | ||
|
||
However, the feature has been on for a long time already the there were no | ||
bugs or requests raised around the choice of batch period. Moreover, the | ||
introduced batch period was considered an important element of the Job | ||
controller, and is now not guarded by the feature gate since the | ||
[PR#118615](https://github.com/kubernetes/kubernetes/pull/118615) which is | ||
already released in 1.28. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to comments by @jeremyrickard in:
#4237 (comment)
I have some more:
Can you link them?
Do you have any rate-limitting (e.g. we don't update a job more than once per second or sth like that)?
If so - can you please describe. If not, we definitely should add something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comments @wojtek-t . There are unit and integration tests, now linked.
As for the rate-limiting we have a mechanism, batching all Job reconciliations. The
syncJob
invocations are batched by time of 1s, and deduplicated as we use theAddAfter
to queue the syncs: https://github.com/kubernetes/kubernetes/blob/e8abe1af8dcb36f65ef7aa7135d4664b3db90e89/pkg/controller/job/job_controller.go#L560.