Graduate KEP-2879 PodsReady to stable #4237

mimowo · 2023-09-25T08:28:12Z

One-line PR description:

Issue link: Track Ready Pods in Job status #2879

Other comments:

mimowo · 2023-09-25T08:29:16Z

/assign @alculquicondor

kannon92 · 2023-09-25T14:50:16Z

Is this still WIP?

mimowo · 2023-09-25T14:53:36Z

Is this still WIP?

Just removed the prefix, ready for review :)

kannon92 · 2023-09-25T14:57:16Z

keps/sig-apps/2879-ready-pods-job-status/README.md

  `Job.status.running` with the advantage of it being configurable.
+
+- We considered explore different batch periods for regular pod updates versus


This section reads a bit weird. I can't figure out the right way to phrase it though..

So you are saying:

We considered testing different batch periods but decided that since there was no bugs or requests raised, we are going forward with the defaults?

Yes, I suggest to keep it simple. Proving that some configuration of delays is better than another in general will be very hard, I think results will depend on various factors. I think configurable delays might be something to consider in the future as a dedicated KEP, if needed.

Also, it seems to me that changing the delays would help the KEP in beta for another release to test how it works after the change. wdyt @alculquicondor @soltysh ?

keps/sig-apps/2879-ready-pods-job-status/README.md

kannon92 · 2023-09-25T15:01:04Z

Very small nits.
/lgtm

keps/sig-apps/2879-ready-pods-job-status/README.md

alculquicondor

/lgtm

keps/sig-apps/2879-ready-pods-job-status/README.md

alculquicondor · 2023-09-27T13:40:48Z

keps/sig-apps/2879-ready-pods-job-status/README.md


 #### Deprecation

-N/A
+In GA+1 release:
+- Job controller ignores the feature gate


This can be done in the GA version.

Yeah, I just wasn't sure what is the policy, if any.

If we ignore, then we can skip using the LockToDefault. I've noticed though it is used by a number of feature gates in https://github.com/kubernetes/kubernetes/blob/master/pkg/features/kube_features.go.

It's still useful to LockToDefault for anyone looking at the code.
Not sure if feature gates values are logged, but if they do, it would also be useful in that case.

Aldo is right, at GA we block the feature gate to always on, in GA+2 we drop the feature from the code.

So, IIUC, in GA we use LockToDefault, but keep the ifs in code. GA+1 we do nothing, then GA+2 we delete both the declaration and the ifs. @soltysh for confirmation.

That's the practice, it ensures that if we need to revert GA graduation we have all in place.

Ok, so one more small change - I have moved Job controller ignores the feature gate from GA+1 to GA+2. PTAL

alculquicondor · 2023-09-27T13:41:52Z

/assign @soltysh

soltysh

Minor nits, otherwise this is good as is

keps/sig-apps/2879-ready-pods-job-status/README.md

soltysh · 2023-09-29T12:02:18Z

keps/sig-apps/2879-ready-pods-job-status/README.md


 #### Deprecation

-N/A
+In GA+1 release:
+- Job controller ignores the feature gate


Aldo is right, at GA we block the feature gate to always on, in GA+2 we drop the feature from the code.

soltysh

/lgtm
/approve
from sig-apps pov

jeremyrickard · 2023-10-02T00:42:32Z

Hello 👋 I'm giving this a PRR review. It looks like you mostly filled this out for beta and it isn't changed here, and it mostly looks good but I have a few questions:

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

This spells out a manual test that was planned. I assume this was manually tested during beta? Could you verify that this worked sufficiently before this becomes enabled by default? Were there any unusual issues or errors during this testing?

What are other known failure modes?

When the cluster has apiservers with skewed versions, the Job.status.ready
might remain zero.

Is it possible for the controller to not update this status when a cluster isn't skewed? If so, how would an operator determine this is the case?

You mentioned a metric in the SLI section: 'job_sync_total, but it doesn't seem to be referenced anywhere else. Is that the same as job_sync_num` that is indicated in the rollback section (i..e when there is a reduction in that metric)? Is any reduction in that a key to rollback or is there a threshold that would be informative?

You're also missing one question from the scalability section that needs to be answered: https://github.com/kubernetes/enhancements/blob/master/keps/NNNN-kep-template/README.md#can-enabling--using-this-feature-result-in-resource-exhaustion-of-some-node-resources-pids-sockets-inodes-etc

wojtek-t · 2023-10-02T07:49:18Z

keps/prod-readiness/sig-apps/2879.yaml

+  approver: "@wojtek-t"
+stable:


In addition to comments by @jeremyrickard in:
#4237 (comment)

I have some more:

Are there any tests for feature enablement/disablement?

Can you link them?

High-level question : the additional load on apiserver is potentially concerning [in GA we will no longer be able to disable the feature].
Do you have any rate-limitting (e.g. we don't update a job more than once per second or sth like that)?

If so - can you please describe. If not, we definitely should add something.

Thanks for the comments @wojtek-t . There are unit and integration tests, now linked.

As for the rate-limiting we have a mechanism, batching all Job reconciliations. The syncJob invocations are batched by time of 1s, and deduplicated as we use the AddAfter to queue the syncs: https://github.com/kubernetes/kubernetes/blob/e8abe1af8dcb36f65ef7aa7135d4664b3db90e89/pkg/controller/job/job_controller.go#L560.

mimowo · 2023-10-02T08:52:33Z

@jeremyrickard thanks for reviewing! I have addressed the comments. See responses inlined.

mimowo · 2023-10-02T08:54:07Z

keps/sig-apps/2879-ready-pods-job-status/README.md


 ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

-A manual test will be performed, as follows:
+A manual test on Beta was performed, as follows:


This testing scenario is updated. One slight change compared to the scenario was point Verify that ready field in Job A is not lost, but also not updated.. In fact the Job's ready field is cleaned up. This is because after restart of the downgraded Job controller, all jobs are synced, due to the ADDED event.

mimowo · 2023-10-02T08:54:59Z

keps/sig-apps/2879-ready-pods-job-status/README.md

@@ -259,7 +262,7 @@ the controller doesn't create new Pods or tracks finishing Pods.
 ###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

 - [x] Metrics
-  - Metric name: `job_sync_duration_seconds`, `job_sync_total`.
+  - Metric name: `job_sync_duration_seconds`, `job_syncs_total`.


Adjusted the metric name to the actual value: https://github.com/kubernetes/kubernetes/blob/e8abe1af8dcb36f65ef7aa7135d4664b3db90e89/pkg/controller/job/metrics/metrics.go#L52

mimowo · 2023-10-02T08:58:45Z

keps/sig-apps/2879-ready-pods-job-status/README.md

@@ -314,8 +321,7 @@ No change from existing behavior of the Job controller.

 ###### What are other known failure modes?

- When the cluster has apiservers with skewed versions, the `Job.status.ready`
-  might remain zero.
+No.


I'm not actually aware of any failure modes currently. I'm also not sure why the field would be zero if the API server version is skewed. I assume, for older API servers, which don't know the field we would have nil. @alculquicondor for insights.

keps/sig-apps/2879-ready-pods-job-status/README.md

wojtek-t · 2023-10-02T12:35:36Z

keps/sig-apps/2879-ready-pods-job-status/README.md

@@ -210,7 +211,9 @@ The Job controller will start populating the field again.

 ###### Are there any tests for feature enablement/disablement?

-Yes, there are tests at unit and [integration] level.
+Yes, there are unit tests (see [link](https://github.com/kubernetes/kubernetes/blob/e8abe1af8dcb36f65ef7aa7135d4664b3db90e89/pkg/controller/job/job_controller_test.go#L236)).


These are nice feature tests, but (please correct me if I'm wrong), these aren't enablement/disablement tests.

Enablement/disablement tests are tests that effectively change the FG in the middle of the test to see what happens if I enable or disable the feature.

Do you have any tests like these?

I don't think we have (also could not find). I've only done manual testing of reenabling as described here "Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?". FYI @alculquicondor

I can add it to requirements, but I'm not sure the test will work on GA as we plan to lock the feature gate.

So, the options I see:

resort to manual testing

add the test, but remove in the same release if it blocks locking of the feature gate (using LockToDefault)

add the test, and delay graduation to GA if the test blocks locking of the feature gate
WDYT?

It's too late now, exactly because what you wrote (the tets won't work after locking FG).

Let's just update the text here to something:
We have feature tests [links to what you linked here] and due to omission we graduated to beta without enablement/disablements tests with just manual test run. With graduation to stable it's too late to add these tests so we're sticking with just manual tests (see below).

Thanks, added a note on this along the lines.

wojtek-t · 2023-10-02T12:37:09Z

keps/sig-apps/2879-ready-pods-job-status/README.md

@@ -281,7 +284,7 @@ No.

  Estimated throughput: at most one API call for each Job Pod reaching Ready
  condition.
-  
+


Please describe the rate-limitting mechanism (that you mentioned in the comment) here - it's a very important mitigation that has to be mentioned here.

Added description, let me know if this is enough.

Looks good - thanks.

wojtek-t

One last comment from me - other than that LGTM.

wojtek-t · 2023-10-02T13:50:23Z

keps/sig-apps/2879-ready-pods-job-status/README.md

@@ -210,7 +211,9 @@ The Job controller will start populating the field again.

 ###### Are there any tests for feature enablement/disablement?

-Yes, there are tests at unit and [integration] level.
+Yes, there are unit tests (see [link](https://github.com/kubernetes/kubernetes/blob/e8abe1af8dcb36f65ef7aa7135d4664b3db90e89/pkg/controller/job/job_controller_test.go#L236)).


It's too late now, exactly because what you wrote (the tets won't work after locking FG).

Let's just update the text here to something:
We have feature tests [links to what you linked here] and due to omission we graduated to beta without enablement/disablements tests with just manual test run. With graduation to stable it's too late to add these tests so we're sticking with just manual tests (see below).

wojtek-t · 2023-10-02T13:50:53Z

keps/sig-apps/2879-ready-pods-job-status/README.md

@@ -281,7 +284,7 @@ No.

  Estimated throughput: at most one API call for each Job Pod reaching Ready
  condition.
-  
+


Looks good - thanks.

wojtek-t · 2023-10-03T07:30:55Z

/lgtm
/approve PRR

Thanks!

k8s-ci-robot · 2023-10-03T07:31:04Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mimowo, soltysh, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [wojtek-t]
~~keps/sig-apps/OWNERS~~ [soltysh]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Sep 25, 2023

k8s-ci-robot requested a review from kow3ns September 25, 2023 08:28

k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Sep 25, 2023

k8s-ci-robot requested a review from soltysh September 25, 2023 08:28

k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 25, 2023

k8s-ci-robot assigned alculquicondor Sep 25, 2023

mimowo changed the title ~~WIP: Graduate KEP-2879 PodsReady to stable~~ Graduate KEP-2879 PodsReady to stable Sep 25, 2023

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 25, 2023

kannon92 reviewed Sep 25, 2023

View reviewed changes

keps/sig-apps/2879-ready-pods-job-status/README.md Outdated Show resolved Hide resolved

k8s-ci-robot assigned kannon92 Sep 25, 2023

k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Sep 25, 2023

mimowo commented Sep 27, 2023

View reviewed changes

keps/sig-apps/2879-ready-pods-job-status/README.md Show resolved Hide resolved

alculquicondor reviewed Sep 27, 2023

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 27, 2023

k8s-ci-robot assigned soltysh Sep 27, 2023

soltysh mentioned this pull request Sep 29, 2023

Track Ready Pods in Job status #2879

Closed

14 tasks

soltysh reviewed Sep 29, 2023

View reviewed changes

mimowo force-pushed the pods-ready-ga branch from 739b663 to 459233a Compare September 29, 2023 12:29

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 29, 2023

soltysh approved these changes Sep 29, 2023

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 29, 2023

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 29, 2023

mimowo force-pushed the pods-ready-ga branch from 658c0ef to f4de325 Compare September 29, 2023 17:07

Graduate KEP-2879 PodsReady to stable

0e13432

wojtek-t reviewed Oct 2, 2023

View reviewed changes

wojtek-t self-assigned this Oct 2, 2023

mimowo force-pushed the pods-ready-ga branch 2 times, most recently from 34d8b43 to 624d560 Compare October 2, 2023 08:43

PRR Remarks

fc9b1c8

mimowo force-pushed the pods-ready-ga branch from 624d560 to fc9b1c8 Compare October 2, 2023 08:51

mimowo commented Oct 2, 2023

View reviewed changes

keps/sig-apps/2879-ready-pods-job-status/README.md Show resolved Hide resolved

wojtek-t reviewed Oct 2, 2023

View reviewed changes

mimowo force-pushed the pods-ready-ga branch from 06b5b95 to d0a45b3 Compare October 2, 2023 13:06

PRR Remarks - batching mechanism

f4a0ad5

mimowo force-pushed the pods-ready-ga branch from d0a45b3 to f4a0ad5 Compare October 2, 2023 13:09

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 2, 2023

wojtek-t reviewed Oct 2, 2023

View reviewed changes

manual tests

bed91e7

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 3, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 3, 2023

k8s-ci-robot merged commit 7fabb16 into kubernetes:master Oct 3, 2023

k8s-ci-robot added this to the v1.29 milestone Oct 3, 2023

alculquicondor mentioned this pull request Oct 4, 2023

Mark KEP for JobReadyPods as implemented #4271

Merged

		`Job.status.running` with the advantage of it being configurable.

		- We considered explore different batch periods for regular pod updates versus

		@@ -281,7 +284,7 @@ No.

		Estimated throughput: at most one API call for each Job Pod reaching Ready
		condition.

Graduate KEP-2879 PodsReady to stable #4237

Graduate KEP-2879 PodsReady to stable #4237

Conversation

mimowo commented Sep 25, 2023

mimowo commented Sep 25, 2023

kannon92 commented Sep 25, 2023

mimowo commented Sep 25, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kannon92 commented Sep 25, 2023

alculquicondor left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mimowo Sep 29, 2023 • edited Loading

Choose a reason for hiding this comment

alculquicondor commented Sep 27, 2023

soltysh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

soltysh left a comment

Choose a reason for hiding this comment

jeremyrickard commented Oct 2, 2023

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

What are other known failure modes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mimowo commented Oct 2, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mimowo Oct 2, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wojtek-t left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wojtek-t commented Oct 3, 2023

k8s-ci-robot commented Oct 3, 2023

mimowo Sep 29, 2023 •

edited

Loading

mimowo Oct 2, 2023 •

edited

Loading