-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Graduate KEP-2879 PodsReady to stable #4237
Conversation
mimowo
commented
Sep 25, 2023
- One-line PR description:
- Issue link: Track Ready Pods in Job status #2879
- Other comments:
/assign @alculquicondor |
Is this still WIP? |
Just removed the prefix, ready for review :) |
`Job.status.running` with the advantage of it being configurable. | ||
|
||
- We considered explore different batch periods for regular pod updates versus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section reads a bit weird. I can't figure out the right way to phrase it though..
So you are saying:
We considered testing different batch periods but decided that since there was no bugs or requests raised, we are going forward with the defaults?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I suggest to keep it simple. Proving that some configuration of delays is better than another in general will be very hard, I think results will depend on various factors. I think configurable delays might be something to consider in the future as a dedicated KEP, if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it seems to me that changing the delays would help the KEP in beta for another release to test how it works after the change. wdyt @alculquicondor @soltysh ?
Very small nits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
|
||
#### Deprecation | ||
|
||
N/A | ||
In GA+1 release: | ||
- Job controller ignores the feature gate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be done in the GA version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I just wasn't sure what is the policy, if any.
If we ignore, then we can skip using the LockToDefault
. I've noticed though it is used by a number of feature gates in https://github.com/kubernetes/kubernetes/blob/master/pkg/features/kube_features.go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's still useful to LockToDefault for anyone looking at the code.
Not sure if feature gates values are logged, but if they do, it would also be useful in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aldo is right, at GA we block the feature gate to always on, in GA+2 we drop the feature from the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, IIUC, in GA we use LockToDefault
, but keep the if
s in code. GA+1 we do nothing, then GA+2 we delete both the declaration and the if
s. @soltysh for confirmation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the practice, it ensures that if we need to revert GA graduation we have all in place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so one more small change - I have moved Job controller ignores the feature gate
from GA+1 to GA+2. PTAL
/assign @soltysh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nits, otherwise this is good as is
|
||
#### Deprecation | ||
|
||
N/A | ||
In GA+1 release: | ||
- Job controller ignores the feature gate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aldo is right, at GA we block the feature gate to always on, in GA+2 we drop the feature from the code.
739b663
to
459233a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
from sig-apps pov
658c0ef
to
f4de325
Compare
Hello 👋 I'm giving this a PRR review. It looks like you mostly filled this out for beta and it isn't changed here, and it mostly looks good but I have a few questions:
This spells out a manual test that was planned. I assume this was manually tested during beta? Could you verify that this worked sufficiently before this becomes enabled by default? Were there any unusual issues or errors during this testing?
Is it possible for the controller to not update this status when a cluster isn't skewed? If so, how would an operator determine this is the case? You mentioned a metric in the SLI section: 'job_sync_total You're also missing one question from the scalability section that needs to be answered: https://github.com/kubernetes/enhancements/blob/master/keps/NNNN-kep-template/README.md#can-enabling--using-this-feature-result-in-resource-exhaustion-of-some-node-resources-pids-sockets-inodes-etc |
approver: "@wojtek-t" | ||
stable: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to comments by @jeremyrickard in:
#4237 (comment)
I have some more:
Are there any tests for feature enablement/disablement?
Can you link them?
- High-level question : the additional load on apiserver is potentially concerning [in GA we will no longer be able to disable the feature].
Do you have any rate-limitting (e.g. we don't update a job more than once per second or sth like that)?
If so - can you please describe. If not, we definitely should add something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comments @wojtek-t . There are unit and integration tests, now linked.
As for the rate-limiting we have a mechanism, batching all Job reconciliations. The syncJob
invocations are batched by time of 1s, and deduplicated as we use the AddAfter
to queue the syncs: https://github.com/kubernetes/kubernetes/blob/e8abe1af8dcb36f65ef7aa7135d4664b3db90e89/pkg/controller/job/job_controller.go#L560.
34d8b43
to
624d560
Compare
@jeremyrickard thanks for reviewing! I have addressed the comments. See responses inlined. |
|
||
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? | ||
|
||
A manual test will be performed, as follows: | ||
A manual test on Beta was performed, as follows: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This testing scenario is updated. One slight change compared to the scenario was point Verify that ready field in Job A is not lost, but also not updated.
. In fact the Job's ready field is cleaned up. This is because after restart of the downgraded Job controller, all jobs are synced, due to the ADDED event.
@@ -259,7 +262,7 @@ the controller doesn't create new Pods or tracks finishing Pods. | |||
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? | |||
|
|||
- [x] Metrics | |||
- Metric name: `job_sync_duration_seconds`, `job_sync_total`. | |||
- Metric name: `job_sync_duration_seconds`, `job_syncs_total`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adjusted the metric name to the actual value: https://github.com/kubernetes/kubernetes/blob/e8abe1af8dcb36f65ef7aa7135d4664b3db90e89/pkg/controller/job/metrics/metrics.go#L52
@@ -314,8 +321,7 @@ No change from existing behavior of the Job controller. | |||
|
|||
###### What are other known failure modes? | |||
|
|||
- When the cluster has apiservers with skewed versions, the `Job.status.ready` | |||
might remain zero. | |||
No. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not actually aware of any failure modes currently. I'm also not sure why the field would be zero if the API server version is skewed. I assume, for older API servers, which don't know the field we would have nil
. @alculquicondor for insights.
@@ -210,7 +211,9 @@ The Job controller will start populating the field again. | |||
|
|||
###### Are there any tests for feature enablement/disablement? | |||
|
|||
Yes, there are tests at unit and [integration] level. | |||
Yes, there are unit tests (see [link](https://github.com/kubernetes/kubernetes/blob/e8abe1af8dcb36f65ef7aa7135d4664b3db90e89/pkg/controller/job/job_controller_test.go#L236)). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are nice feature tests, but (please correct me if I'm wrong), these aren't enablement/disablement tests.
Enablement/disablement tests are tests that effectively change the FG in the middle of the test to see what happens if I enable or disable the feature.
Do you have any tests like these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we have (also could not find). I've only done manual testing of reenabling as described here "Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?". FYI @alculquicondor
I can add it to requirements, but I'm not sure the test will work on GA as we plan to lock the feature gate.
So, the options I see:
- resort to manual testing
- add the test, but remove in the same release if it blocks locking of the feature gate (using
LockToDefault
) - add the test, and delay graduation to GA if the test blocks locking of the feature gate
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's too late now, exactly because what you wrote (the tets won't work after locking FG).
Let's just update the text here to something:
We have feature tests [links to what you linked here] and due to omission we graduated to beta without enablement/disablements tests with just manual test run. With graduation to stable it's too late to add these tests so we're sticking with just manual tests (see below).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, added a note on this along the lines.
@@ -281,7 +284,7 @@ No. | |||
|
|||
Estimated throughput: at most one API call for each Job Pod reaching Ready | |||
condition. | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please describe the rate-limitting mechanism (that you mentioned in the comment) here - it's a very important mitigation that has to be mentioned here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added description, let me know if this is enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good - thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last comment from me - other than that LGTM.
@@ -210,7 +211,9 @@ The Job controller will start populating the field again. | |||
|
|||
###### Are there any tests for feature enablement/disablement? | |||
|
|||
Yes, there are tests at unit and [integration] level. | |||
Yes, there are unit tests (see [link](https://github.com/kubernetes/kubernetes/blob/e8abe1af8dcb36f65ef7aa7135d4664b3db90e89/pkg/controller/job/job_controller_test.go#L236)). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's too late now, exactly because what you wrote (the tets won't work after locking FG).
Let's just update the text here to something:
We have feature tests [links to what you linked here] and due to omission we graduated to beta without enablement/disablements tests with just manual test run. With graduation to stable it's too late to add these tests so we're sticking with just manual tests (see below).
@@ -281,7 +284,7 @@ No. | |||
|
|||
Estimated throughput: at most one API call for each Job Pod reaching Ready | |||
condition. | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good - thanks.
/lgtm Thanks! |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mimowo, soltysh, wojtek-t The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |