KEP-2879: Add count of ready Pods in Job status #2880

alculquicondor · 2021-08-19T19:11:53Z

Ref #2879

Proposed the addition of field Job.status.ready.

Includes PRR

alculquicondor · 2021-08-19T19:12:35Z

/assign @soltysh @ehashman

/api-review

alculquicondor · 2021-08-19T19:12:46Z

/label api-review

alculquicondor · 2021-08-19T19:51:47Z

@gaocegege I would appreciate your feedback

keps/sig-apps/2879-ready-pods-job-status/README.md

gaocegege

Thanks for the proposal. I think it is helpful.

keps/sig-apps/2879-ready-pods-job-status/README.md

when the Pod doesn't define a readiness probe.

gaocegege

LGTM. thanks for the enhancement.

keps/sig-apps/2879-ready-pods-job-status/README.md

keps/prod-readiness/sig-apps/2879.yaml

wojtek-t · 2021-08-23T08:21:53Z

/assign

liggitt · 2021-08-23T17:53:09Z

/assign @lavalamp

per https://github.com/kubernetes/kubernetes/blob/e8263c23252218066ca3f80b05c1fe9d9b561284/OWNERS_ALIASES#L390-L393 and shortest queue in https://github.com/orgs/kubernetes/projects/13

keps/sig-apps/2879-ready-pods-job-status/README.md

lavalamp · 2021-08-23T23:16:29Z

API lgtm if you can not have the first / second version difference.

keps/sig-apps/2879-ready-pods-job-status/README.md

soltysh

Mostly nits.
/lgtm
/approve
from sig-apps perspective

keps/sig-apps/2879-ready-pods-job-status/README.md

soltysh · 2021-08-31T17:56:23Z

keps/sig-apps/2879-ready-pods-job-status/kep.yaml

+latest-milestone: "v1.23"
+
+milestone:
+  beta: "v1.23"


Oops. Fixed

keps/sig-apps/2879-ready-pods-job-status/README.md

wojtek-t · 2021-09-01T08:22:40Z

keps/sig-apps/2879-ready-pods-job-status/README.md

+  - The job controller is updating other status fields.
+  - The number of ready Pods equals `Job.spec.parallelism`.
+  - The increase of ready Pods is greater than or equal to 10% of
+    `Job.spec.parallelism`.


Let's say that because of cluster capacity, 100% of my jobs will never have place to run. But let's say 99% has.
With the policies above it may happen that we update ready to 90%, but we will never change that to 99%. So the system isn't eventually consistent, which I think is problematic.

I think that you need another rule, i.e. when something changes, it will be applied with X seconds/minutes (i.e. we batch it for such period).

[FWIW - such logic is super simple to implement, e.g. https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/endpoint/endpoints_controller.go#L224 ]

True, being eventually consistent should be a requirement.

Could the solution for the job controller be the same? We would delay/accumulate any sync coming from Pod creation/updates/deletions. This might actually be good for the overall performance of the controller. The delay for endpoint slices is configurable. Should we do the same? I'm proposing a 100ms window otherwise.

wojtek-t · 2021-09-01T08:23:40Z

keps/sig-apps/2879-ready-pods-job-status/README.md

+
+###### Are there any tests for feature enablement/disablement?
+
+Yes, at unit and integration level.


I'm assuming "they will be added", right?

Yes, changed wording.

wojtek-t · 2021-09-01T08:24:36Z

keps/sig-apps/2879-ready-pods-job-status/README.md

+
+###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
+
+The 99% percentile of Job status updates below 1s, when the controller doesn't


Are you talking about API calls or about the processing logic in the controller? It's not clear to me.

the E2E latency of a sync. Reworded. Maybe it should be 2s, because just the API call is 1s.

wojtek-t · 2021-09-01T19:18:23Z

keps/sig-apps/2879-ready-pods-job-status/README.md

+### Risks and Mitigations
+
+- An increase in Job status updates. To mitigate this, the job controller holds
+  the Pod updates that happen in 100ms before syncing a Job.


100ms is negligible imho;

I would use at least 1s as a batching period

How about I leave it open until I have the integration tests to do some experiments?

1s starts to sound a bit too long considering that we have to hold any Pod updates. See updated KEP

ahg-g · 2021-09-02T02:12:44Z

keps/prod-readiness/sig-apps/2879.yaml

@@ -0,0 +1,3 @@
+kep-number: 2879
+alpha:


beta - yes, but we need to make sure that this is properly linked with #2307 which expands.

make sure to add that ref in the kep.yaml, pls.

I forgot to update the description: I'm no longer proposing to start at beta.

I don't see this KEP as an expansion of #2307.

Yeah - I think this should be alpha.

wojtek-t · 2021-09-02T08:21:28Z

keps/sig-apps/2879-ready-pods-job-status/README.md

+
+- An increase in Job status updates. To mitigate this, the job controller holds
+  the Pod updates that happen in X ms before syncing a Job. X will be determined
+  from experiments on integration tests, but we expect it to be between 100ms


I think 100ms delay isn't meaningful as it's way below the SLO for API call.

I personally recommend not even considering anything lower than 0.5s.
TBH, in this particular case I would say that even couple seconds might be fine, but I can imagine counterarguments too, so let's maybe stick to 500ms-1s interval for now.

WDYT?

Sounds good. Updated to 500ms-1s.

wojtek-t · 2021-09-06T06:18:45Z

/lgtm
/approve

k8s-ci-robot · 2021-09-06T06:19:01Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, soltysh, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [wojtek-t]
~~keps/sig-apps/OWNERS~~ [soltysh]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

kerthcet · 2021-09-07T07:14:33Z

keps/sig-apps/2879-ready-pods-job-status/README.md

+
+## Implementation History
+
+- 2021-08-19: Proposed KEP starting in beta status.


shouldn't this be marked as starting at Alpha status? @alculquicondor

It should. I'll fix it in the next update.

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 19, 2021

k8s-ci-robot requested review from kow3ns and soltysh August 19, 2021 19:12

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/apps Categorizes an issue or PR as relevant to SIG Apps. labels Aug 19, 2021

alculquicondor force-pushed the job-running branch from 3516e4a to d114be1 Compare August 19, 2021 19:12

k8s-ci-robot assigned ehashman and soltysh Aug 19, 2021

k8s-ci-robot added the api-review Categorizes an issue or PR as actively needing an API review. label Aug 19, 2021

alculquicondor force-pushed the job-running branch 2 times, most recently from 9bca8ec to afbd037 Compare August 19, 2021 19:48

Add count of ready Pods in Job status

0de12d1

alculquicondor force-pushed the job-running branch from afbd037 to 0de12d1 Compare August 19, 2021 19:51

This was referenced Aug 19, 2021

Evaluate usage of upstream Job API kubeflow/training-operator#1303

Open

Track Ready Pods in Job status #2879

Closed

ahg-g reviewed Aug 19, 2021

View reviewed changes

keps/sig-apps/2879-ready-pods-job-status/README.md Show resolved Hide resolved

gaocegege reviewed Aug 20, 2021

View reviewed changes

keps/sig-apps/2879-ready-pods-job-status/README.md Show resolved Hide resolved

Clarified behavior of Ready condition

0b793e1

when the Pod doesn't define a readiness probe.

gaocegege reviewed Aug 22, 2021

View reviewed changes

keps/sig-apps/2879-ready-pods-job-status/README.md Show resolved Hide resolved

wojtek-t reviewed Aug 23, 2021

View reviewed changes

keps/prod-readiness/sig-apps/2879.yaml Outdated Show resolved Hide resolved

k8s-ci-robot assigned wojtek-t Aug 23, 2021

k8s-ci-robot assigned lavalamp Aug 23, 2021

lavalamp reviewed Aug 23, 2021

View reviewed changes

keps/sig-apps/2879-ready-pods-job-status/README.md Outdated Show resolved Hide resolved

lavalamp reviewed Aug 23, 2021

View reviewed changes

keps/sig-apps/2879-ready-pods-job-status/README.md Outdated Show resolved Hide resolved

Add alpha stage

2684cc5

alculquicondor force-pushed the job-running branch from b13d1c9 to 2684cc5 Compare August 25, 2021 14:50

wojtek-t reviewed Aug 30, 2021

View reviewed changes

keps/sig-apps/2879-ready-pods-job-status/README.md Show resolved Hide resolved

alculquicondor force-pushed the job-running branch 2 times, most recently from 858e302 to a38669a Compare August 30, 2021 15:02

kikisdeliveryservice changed the title ~~Add count of ready Pods in Job status~~ KEP-2879: Add count of ready Pods in Job status Aug 30, 2021

soltysh approved these changes Aug 31, 2021

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 31, 2021

alculquicondor force-pushed the job-running branch from a38669a to 14358c4 Compare August 31, 2021 18:35

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 31, 2021

wojtek-t reviewed Sep 1, 2021

View reviewed changes

alculquicondor force-pushed the job-running branch 3 times, most recently from aa9e6fe to f86af30 Compare September 1, 2021 18:37

wojtek-t reviewed Sep 1, 2021

View reviewed changes

alculquicondor force-pushed the job-running branch from f86af30 to 24e1c4c Compare September 1, 2021 19:40

ahg-g reviewed Sep 2, 2021

View reviewed changes

wojtek-t reviewed Sep 2, 2021

View reviewed changes

Mitigate number of status updates

5ee7b73

alculquicondor force-pushed the job-running branch from 24e1c4c to 5ee7b73 Compare September 2, 2021 14:18

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 6, 2021

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 6, 2021

k8s-ci-robot merged commit 61f92a6 into kubernetes:master Sep 6, 2021

k8s-ci-robot added this to the v1.23 milestone Sep 6, 2021

kerthcet reviewed Sep 7, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KEP-2879: Add count of ready Pods in Job status #2880

KEP-2879: Add count of ready Pods in Job status #2880

alculquicondor commented Aug 19, 2021 •

edited

Loading

alculquicondor commented Aug 19, 2021

alculquicondor commented Aug 19, 2021

alculquicondor commented Aug 19, 2021

gaocegege left a comment

gaocegege left a comment

wojtek-t commented Aug 23, 2021

liggitt commented Aug 23, 2021

lavalamp commented Aug 23, 2021

soltysh left a comment

soltysh Aug 31, 2021

alculquicondor Aug 31, 2021

wojtek-t Sep 1, 2021

alculquicondor Sep 1, 2021

wojtek-t Sep 1, 2021

alculquicondor Sep 1, 2021

wojtek-t Sep 1, 2021

alculquicondor Sep 1, 2021

wojtek-t Sep 1, 2021

alculquicondor Sep 1, 2021 •

edited

Loading

ahg-g Sep 2, 2021

soltysh Sep 2, 2021

soltysh Sep 2, 2021

alculquicondor Sep 2, 2021

wojtek-t Sep 6, 2021

wojtek-t Sep 2, 2021

alculquicondor Sep 2, 2021

wojtek-t commented Sep 6, 2021

k8s-ci-robot commented Sep 6, 2021

kerthcet Sep 7, 2021

alculquicondor Sep 7, 2021


		###### Are there any tests for feature enablement/disablement?

		Yes, at unit and integration level.


		###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?

		The 99% percentile of Job status updates below 1s, when the controller doesn't


		## Implementation History

		- 2021-08-19: Proposed KEP starting in beta status.

KEP-2879: Add count of ready Pods in Job status #2880

KEP-2879: Add count of ready Pods in Job status #2880

Conversation

alculquicondor commented Aug 19, 2021 • edited Loading

alculquicondor commented Aug 19, 2021

alculquicondor commented Aug 19, 2021

alculquicondor commented Aug 19, 2021

gaocegege left a comment

Choose a reason for hiding this comment

gaocegege left a comment

Choose a reason for hiding this comment

wojtek-t commented Aug 23, 2021

liggitt commented Aug 23, 2021

lavalamp commented Aug 23, 2021

soltysh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alculquicondor Sep 1, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wojtek-t commented Sep 6, 2021

k8s-ci-robot commented Sep 6, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alculquicondor commented Aug 19, 2021 •

edited

Loading

alculquicondor Sep 1, 2021 •

edited

Loading