Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graduate KEP-2879 PodsReady to stable #4237

Merged
merged 4 commits into from
Oct 3, 2023
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion keps/prod-readiness/sig-apps/2879.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,6 @@ kep-number: 2879
alpha:
approver: "@wojtek-t"
beta:
approver: "@wojtek-t"
approver: "@wojtek-t"
stable:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to comments by @jeremyrickard in:
#4237 (comment)

I have some more:

Are there any tests for feature enablement/disablement?

Can you link them?

  1. High-level question : the additional load on apiserver is potentially concerning [in GA we will no longer be able to disable the feature].
    Do you have any rate-limitting (e.g. we don't update a job more than once per second or sth like that)?

If so - can you please describe. If not, we definitely should add something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments @wojtek-t . There are unit and integration tests, now linked.

As for the rate-limiting we have a mechanism, batching all Job reconciliations. The syncJob invocations are batched by time of 1s, and deduplicated as we use the AddAfter to queue the syncs: https://github.com/kubernetes/kubernetes/blob/e8abe1af8dcb36f65ef7aa7135d4664b3db90e89/pkg/controller/job/job_controller.go#L560.

approver: "@wojtek-t"
76 changes: 47 additions & 29 deletions keps/sig-apps/2879-ready-pods-job-status/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,17 +38,17 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
- [x] (R) KEP approvers have approved the KEP status as `implementable`
- [x] (R) Design details are appropriately documented
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- [ ] e2e Tests for all Beta API Operations (endpoints)
- [ ] (R) Ensure GA e2e tests for meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
- [x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- [x] e2e Tests for all Beta API Operations (endpoints)
- [ ] (R) Ensure GA e2e tests for meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
mimowo marked this conversation as resolved.
Show resolved Hide resolved
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
- [ ] (R) Graduation criteria is in place
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
- [x] (R) Graduation criteria is in place
- [x] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
- [x] (R) Production readiness review completed
- [x] (R) Production readiness review approved
- [ ] "Implementation History" section is up-to-date for milestone
- [x] "Implementation History" section is up-to-date for milestone
- [x] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
- [x] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

[kubernetes.io]: https://kubernetes.io/
[kubernetes/enhancements]: https://git.k8s.io/enhancements
Expand Down Expand Up @@ -125,6 +125,7 @@ pods that have the `Ready` condition.
- Count of ready pods.
- Feature gate disablement.
- Verify passing existing E2E and conformance tests for Job.
- Added e2e test for the count of ready pods.

### Graduation Criteria

Expand Down Expand Up @@ -157,14 +158,14 @@ pods that have the `Ready` condition.
#### GA

- Every bug report is fixed.
- Explore setting different batch periods for regular pod updates versus
finished pod updates, so we can do less pod readiness updates without
compromising how fast we can declare a job finished.
- The job controller ignores the feature gate.
- E2e test for the count of ready pods.
- Lock the feature-gate and document deprecation of the feature-gate

#### Deprecation

N/A
In GA+2 release:
- Remove the feature gate definition
- Job controller ignores the feature gate

### Upgrade / Downgrade Strategy

Expand Down Expand Up @@ -210,7 +211,9 @@ The Job controller will start populating the field again.

###### Are there any tests for feature enablement/disablement?

Yes, there are tests at unit and [integration] level.
Yes, there are unit tests (see [link](https://github.com/kubernetes/kubernetes/blob/e8abe1af8dcb36f65ef7aa7135d4664b3db90e89/pkg/controller/job/job_controller_test.go#L236)).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are nice feature tests, but (please correct me if I'm wrong), these aren't enablement/disablement tests.

Enablement/disablement tests are tests that effectively change the FG in the middle of the test to see what happens if I enable or disable the feature.

Do you have any tests like these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have (also could not find). I've only done manual testing of reenabling as described here "Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?". FYI @alculquicondor

I can add it to requirements, but I'm not sure the test will work on GA as we plan to lock the feature gate.

So, the options I see:

  1. resort to manual testing
  2. add the test, but remove in the same release if it blocks locking of the feature gate (using LockToDefault)
  3. add the test, and delay graduation to GA if the test blocks locking of the feature gate
    WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's too late now, exactly because what you wrote (the tets won't work after locking FG).

Let's just update the text here to something:
We have feature tests [links to what you linked here] and due to omission we graduated to beta without enablement/disablements tests with just manual test run. With graduation to stable it's too late to add these tests so we're sticking with just manual tests (see below).

Copy link
Contributor Author

@mimowo mimowo Oct 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, added a note on this along the lines.

There are also integration tests (see [link](https://github.com/kubernetes/kubernetes/blob/e8abe1af8dcb36f65ef7aa7135d4664b3db90e89/test/integration/job/job_test.go#L1364) and
[link](https://github.com/kubernetes/kubernetes/blob/e8abe1af8dcb36f65ef7aa7135d4664b3db90e89/test/integration/job/job_test.go#L1517)).

### Rollout, Upgrade and Rollback Planning

Expand All @@ -221,20 +224,20 @@ The field is only informative, it doesn't affect running workloads.
###### What specific metrics should inform a rollback?

- An increase in `job_sync_duration_seconds`.
- A reduction in `job_sync_num`.
- A reduction in `job_syncs_total`.

###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

A manual test will be performed, as follows:
A manual test on Beta was performed, as follows:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This testing scenario is updated. One slight change compared to the scenario was point Verify that ready field in Job A is not lost, but also not updated.. In fact the Job's ready field is cleaned up. This is because after restart of the downgraded Job controller, all jobs are synced, due to the ADDED event.


1. Create a cluster in 1.23.
1. Upgrade to 1.24.
1. Create long running Job A, ensure that the ready field is populated.
1. Downgrade to 1.23.
1. Verify that ready field in Job A is not lost, but also not updated.
1. Create long running Job B, ensure that ready field is not populated.
1. Upgrade to 1.24.
1. Verify that Job A and B ready field is tracked again.
1. Create a cluster in 1.28 with the `JobReadyPods` disabled (`=false`).
2. Simulate upgrade by modifying control-plane manifests to enable `JobReadyPods`.
3. Create long running Job A, ensure that the ready field is populated.
4. Simulate downgrade by modifying control-plane manifests to disable `JobReadyPods`.
5. Verify that ready field in Job A is cleaned up shortly after the startup of the Job controller completes.
6. Create long running Job B, ensure that ready field is not populated.
7. Simulate upgrade by modifying control-plane manifests to enable `JobReadyPods`.
8. Verify that Job A and B ready field is tracked again shortly after the startup of the Job controller completes.

###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Expand All @@ -259,7 +262,7 @@ the controller doesn't create new Pods or tracks finishing Pods.
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

- [x] Metrics
- Metric name: `job_sync_duration_seconds`, `job_sync_total`.
- Metric name: `job_sync_duration_seconds`, `job_syncs_total`.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- Components exposing the metric: `kube-controller-manager`

###### Are there any missing metrics that would be useful to have to improve observability of this feature?
Expand All @@ -281,7 +284,7 @@ No.

Estimated throughput: at most one API call for each Job Pod reaching Ready
condition.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please describe the rate-limitting mechanism (that you mentioned in the comment) here - it's a very important mitigation that has to be mentioned here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added description, let me know if this is enough.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - thanks.

Originating component: job-controller

###### Will enabling / using this feature result in introducing new API types?
Expand All @@ -306,6 +309,10 @@ No.

No.

###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
wojtek-t marked this conversation as resolved.
Show resolved Hide resolved

No.

### Troubleshooting

###### How does this feature react if the API server and/or etcd is unavailable?
Expand All @@ -314,8 +321,7 @@ No change from existing behavior of the Job controller.

###### What are other known failure modes?

- When the cluster has apiservers with skewed versions, the `Job.status.ready`
might remain zero.
No.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not actually aware of any failure modes currently. I'm also not sure why the field would be zero if the API server version is skewed. I assume, for older API servers, which don't know the field we would have nil. @alculquicondor for insights.


###### What steps should be taken if SLOs are not being met to determine the problem?

Expand All @@ -332,6 +338,7 @@ No change from existing behavior of the Job controller.

- 2021-08-19: Proposed KEP starting in alpha status, including full PRR questionnaire.
- 2022-01-05: Proposed graduation to beta.
- 2022-03-20: Merged [PR#107476](https://github.com/kubernetes/kubernetes/pull/107476) with beta implementation

## Drawbacks

Expand All @@ -346,6 +353,17 @@ Pod created.
to accept connections. On the other hand, the `Ready` condition is
configurable through a readiness probe. If the Pod doesn't have a readiness
probe configured, the `Ready` condition is equivalent to the `Running` phase.
In other words, `Job.status.active` provides as the same behavior as

In other words, `Job.status.ready` provides as the same behavior as
`Job.status.running` with the advantage of it being configurable.

- We considered exploring different batch periods for regular pod updates versus
finished pod updates, so we can do less pod readiness updates without
compromising how fast we can declare a job finished.

However, the feature has been on for a long time already the there were no
bugs or requests raised around the choice of batch period. Moreover, the
introduced batch period was considered an important element of the Job
controller, and is now not guarded by the feature gate since the
[PR#118615](https://github.com/kubernetes/kubernetes/pull/118615) which is
already released in 1.28.
7 changes: 4 additions & 3 deletions keps/sig-apps/2879-ready-pods-job-status/kep.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,13 +15,14 @@ approvers:
see-also:
replaces:

stage: beta
stage: stable

latest-milestone: "v1.24"
latest-milestone: "v1.29"

milestone:
alpha: "v1.23"
beta: "v1.24"
stable: "v1.29"

feature-gates:
- name: JobReadyPods
Expand All @@ -32,4 +33,4 @@ disable-supported: true

metrics:
- job_sync_duration_seconds
- job_sync_total
- job_syncs_total