Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graduate KEP-2879 PodsReady to stable #4237

Merged
merged 4 commits into from
Oct 3, 2023

Conversation

mimowo
Copy link
Contributor

@mimowo mimowo commented Sep 25, 2023

  • One-line PR description:
  • Other comments:

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Sep 25, 2023
@k8s-ci-robot k8s-ci-robot requested a review from kow3ns September 25, 2023 08:28
@k8s-ci-robot k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Sep 25, 2023
@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 25, 2023
@mimowo
Copy link
Contributor Author

mimowo commented Sep 25, 2023

/assign @alculquicondor

@kannon92
Copy link
Contributor

Is this still WIP?

@mimowo mimowo changed the title WIP: Graduate KEP-2879 PodsReady to stable Graduate KEP-2879 PodsReady to stable Sep 25, 2023
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 25, 2023
@mimowo
Copy link
Contributor Author

mimowo commented Sep 25, 2023

Is this still WIP?

Just removed the prefix, ready for review :)

`Job.status.running` with the advantage of it being configurable.

- We considered explore different batch periods for regular pod updates versus
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section reads a bit weird. I can't figure out the right way to phrase it though..

So you are saying:

We considered testing different batch periods but decided that since there was no bugs or requests raised, we are going forward with the defaults?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I suggest to keep it simple. Proving that some configuration of delays is better than another in general will be very hard, I think results will depend on various factors. I think configurable delays might be something to consider in the future as a dedicated KEP, if needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it seems to me that changing the delays would help the KEP in beta for another release to test how it works after the change. wdyt @alculquicondor @soltysh ?

@kannon92
Copy link
Contributor

Very small nits.
/lgtm

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Sep 25, 2023
Copy link
Member

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm


#### Deprecation

N/A
In GA+1 release:
- Job controller ignores the feature gate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be done in the GA version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I just wasn't sure what is the policy, if any.

If we ignore, then we can skip using the LockToDefault. I've noticed though it is used by a number of feature gates in https://github.com/kubernetes/kubernetes/blob/master/pkg/features/kube_features.go.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still useful to LockToDefault for anyone looking at the code.
Not sure if feature gates values are logged, but if they do, it would also be useful in that case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aldo is right, at GA we block the feature gate to always on, in GA+2 we drop the feature from the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, IIUC, in GA we use LockToDefault, but keep the ifs in code. GA+1 we do nothing, then GA+2 we delete both the declaration and the ifs. @soltysh for confirmation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the practice, it ensures that if we need to revert GA graduation we have all in place.

Copy link
Contributor Author

@mimowo mimowo Sep 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so one more small change - I have moved Job controller ignores the feature gate from GA+1 to GA+2. PTAL

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 27, 2023
@alculquicondor
Copy link
Member

/assign @soltysh

Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nits, otherwise this is good as is

keps/sig-apps/2879-ready-pods-job-status/README.md Outdated Show resolved Hide resolved

#### Deprecation

N/A
In GA+1 release:
- Job controller ignores the feature gate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aldo is right, at GA we block the feature gate to always on, in GA+2 we drop the feature from the code.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 29, 2023
Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve
from sig-apps pov

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 29, 2023
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 29, 2023
@jeremyrickard
Copy link
Contributor

Hello 👋 I'm giving this a PRR review. It looks like you mostly filled this out for beta and it isn't changed here, and it mostly looks good but I have a few questions:

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

This spells out a manual test that was planned. I assume this was manually tested during beta? Could you verify that this worked sufficiently before this becomes enabled by default? Were there any unusual issues or errors during this testing?

What are other known failure modes?
  • When the cluster has apiservers with skewed versions, the Job.status.ready
    might remain zero.

Is it possible for the controller to not update this status when a cluster isn't skewed? If so, how would an operator determine this is the case?

You mentioned a metric in the SLI section: 'job_sync_total, but it doesn't seem to be referenced anywhere else. Is that the same as job_sync_num` that is indicated in the rollback section (i..e when there is a reduction in that metric)? Is any reduction in that a key to rollback or is there a threshold that would be informative?

You're also missing one question from the scalability section that needs to be answered: https://github.com/kubernetes/enhancements/blob/master/keps/NNNN-kep-template/README.md#can-enabling--using-this-feature-result-in-resource-exhaustion-of-some-node-resources-pids-sockets-inodes-etc

approver: "@wojtek-t"
stable:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to comments by @jeremyrickard in:
#4237 (comment)

I have some more:

Are there any tests for feature enablement/disablement?

Can you link them?

  1. High-level question : the additional load on apiserver is potentially concerning [in GA we will no longer be able to disable the feature].
    Do you have any rate-limitting (e.g. we don't update a job more than once per second or sth like that)?

If so - can you please describe. If not, we definitely should add something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments @wojtek-t . There are unit and integration tests, now linked.

As for the rate-limiting we have a mechanism, batching all Job reconciliations. The syncJob invocations are batched by time of 1s, and deduplicated as we use the AddAfter to queue the syncs: https://github.com/kubernetes/kubernetes/blob/e8abe1af8dcb36f65ef7aa7135d4664b3db90e89/pkg/controller/job/job_controller.go#L560.

@wojtek-t wojtek-t self-assigned this Oct 2, 2023
@mimowo mimowo force-pushed the pods-ready-ga branch 2 times, most recently from 34d8b43 to 624d560 Compare October 2, 2023 08:43
@mimowo
Copy link
Contributor Author

mimowo commented Oct 2, 2023

@jeremyrickard thanks for reviewing! I have addressed the comments. See responses inlined.


###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

A manual test will be performed, as follows:
A manual test on Beta was performed, as follows:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This testing scenario is updated. One slight change compared to the scenario was point Verify that ready field in Job A is not lost, but also not updated.. In fact the Job's ready field is cleaned up. This is because after restart of the downgraded Job controller, all jobs are synced, due to the ADDED event.

@@ -259,7 +262,7 @@ the controller doesn't create new Pods or tracks finishing Pods.
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

- [x] Metrics
- Metric name: `job_sync_duration_seconds`, `job_sync_total`.
- Metric name: `job_sync_duration_seconds`, `job_syncs_total`.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -314,8 +321,7 @@ No change from existing behavior of the Job controller.

###### What are other known failure modes?

- When the cluster has apiservers with skewed versions, the `Job.status.ready`
might remain zero.
No.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not actually aware of any failure modes currently. I'm also not sure why the field would be zero if the API server version is skewed. I assume, for older API servers, which don't know the field we would have nil. @alculquicondor for insights.

@@ -210,7 +211,9 @@ The Job controller will start populating the field again.

###### Are there any tests for feature enablement/disablement?

Yes, there are tests at unit and [integration] level.
Yes, there are unit tests (see [link](https://github.com/kubernetes/kubernetes/blob/e8abe1af8dcb36f65ef7aa7135d4664b3db90e89/pkg/controller/job/job_controller_test.go#L236)).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are nice feature tests, but (please correct me if I'm wrong), these aren't enablement/disablement tests.

Enablement/disablement tests are tests that effectively change the FG in the middle of the test to see what happens if I enable or disable the feature.

Do you have any tests like these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have (also could not find). I've only done manual testing of reenabling as described here "Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?". FYI @alculquicondor

I can add it to requirements, but I'm not sure the test will work on GA as we plan to lock the feature gate.

So, the options I see:

  1. resort to manual testing
  2. add the test, but remove in the same release if it blocks locking of the feature gate (using LockToDefault)
  3. add the test, and delay graduation to GA if the test blocks locking of the feature gate
    WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's too late now, exactly because what you wrote (the tets won't work after locking FG).

Let's just update the text here to something:
We have feature tests [links to what you linked here] and due to omission we graduated to beta without enablement/disablements tests with just manual test run. With graduation to stable it's too late to add these tests so we're sticking with just manual tests (see below).

Copy link
Contributor Author

@mimowo mimowo Oct 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, added a note on this along the lines.

@@ -281,7 +284,7 @@ No.

Estimated throughput: at most one API call for each Job Pod reaching Ready
condition.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please describe the rate-limitting mechanism (that you mentioned in the comment) here - it's a very important mitigation that has to be mentioned here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added description, let me know if this is enough.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - thanks.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 2, 2023
Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last comment from me - other than that LGTM.

@@ -210,7 +211,9 @@ The Job controller will start populating the field again.

###### Are there any tests for feature enablement/disablement?

Yes, there are tests at unit and [integration] level.
Yes, there are unit tests (see [link](https://github.com/kubernetes/kubernetes/blob/e8abe1af8dcb36f65ef7aa7135d4664b3db90e89/pkg/controller/job/job_controller_test.go#L236)).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's too late now, exactly because what you wrote (the tets won't work after locking FG).

Let's just update the text here to something:
We have feature tests [links to what you linked here] and due to omission we graduated to beta without enablement/disablements tests with just manual test run. With graduation to stable it's too late to add these tests so we're sticking with just manual tests (see below).

@@ -281,7 +284,7 @@ No.

Estimated throughput: at most one API call for each Job Pod reaching Ready
condition.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - thanks.

@wojtek-t
Copy link
Member

wojtek-t commented Oct 3, 2023

/lgtm
/approve PRR

Thanks!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 3, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mimowo, soltysh, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 3, 2023
@k8s-ci-robot k8s-ci-robot merged commit 7fabb16 into kubernetes:master Oct 3, 2023
@k8s-ci-robot k8s-ci-robot added this to the v1.29 milestone Oct 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

7 participants