Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the defaults for PodsReadyTimeout backoff more practical #2025

Merged

Conversation

mimowo
Copy link
Contributor

@mimowo mimowo commented Apr 19, 2024

What type of PR is this?

/kind bug
/kind documentation

What this PR does / why we need it:

Which issue(s) this PR fixes:

Part of #2009

Special notes for your reviewer:

WIP because still testing, and I need to update the estimations from KEP and API comments.
Early feedback is welcome.

Does this PR introduce a user-facing change?

Make the defaults for PodsReadyTimeout backoff more practical, as for the original values
the couple of first requeues made the impression as immediate on users (below 10s, which 
is negligible to the wait time spent waiting for PodsReady). 

The defaults values for the formula to determine the exponential back are changed as follows:
- base `1s -> 10s`
- exponent: `1.41284738 -> 2`
So, now the consecutive times to requeue a workload are: 10s, 20s, 40s, ...

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. kind/documentation Categorizes issue or PR as related to documentation. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 19, 2024
@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Apr 19, 2024
Copy link

netlify bot commented Apr 19, 2024

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit 3aa079e
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/66267a1aab2a4f00085df88e

@mimowo mimowo force-pushed the pods-ready-timeout-defaults branch from d17b5de to 9eac5c4 Compare April 22, 2024 09:25
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 22, 2024
@mimowo mimowo force-pushed the pods-ready-timeout-defaults branch from 9eac5c4 to f31ab70 Compare April 22, 2024 10:05
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 22, 2024
@mimowo mimowo changed the title WIP: Make the defaults for PodsReadyTimeout backoff more practical Make the defaults for PodsReadyTimeout backoff more practical Apr 22, 2024
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 22, 2024
@mimowo
Copy link
Contributor Author

mimowo commented Apr 22, 2024

/assign @tenzen-y

@mimowo
Copy link
Contributor Author

mimowo commented Apr 22, 2024

/cc @alculquicondor

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for creating this PR!
Basically, lgtm.

Could you update API documentation (site)? Also, I'm not sure the reason why our CI didn't detect outdated API documentation...

Comment on lines 233 to 234
For comparison, considering `.waitForPodsReady.timeout=300s` (default),
the workload will spend `50min` total waiting for pods ready.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you mention backoffLimitCount the same as before since the backoffLimitCount never defaults any value, and the requeueing will occur forever?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, but I set it to 10 now PTAL

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant the backoffLimitCount in Config API. In this PR, we introduced the fixed value to Duration in backoff calculation, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// When it is null, the workloads will repeatedly and endless re-queueing. Isn't that enough?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant the backoffLimitCount in Config API. In this PR, we introduced the fixed value to Duration in backoff calculation, right?

I discussed this with @mimowo offline. It seemed that our understanding was missing each other, but we could sync our opinions.

// When it is null, the workloads will repeatedly and endless re-queueing. Isn't that enough?

I'm ok with removing this sentence since it seems that this example seems to lose importance.

pkg/controller/core/core.go Outdated Show resolved Hide resolved
@mimowo mimowo force-pushed the pods-ready-timeout-defaults branch from 83d64ed to 3aa079e Compare April 22, 2024 14:54

type ControllerOption func(*ControllerOptions)

func WithControllerRequeuingBaseDelaySeconds(value int32) ControllerOption {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems unnecessary for the cherry-pick

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use it to pass different base in prod and integration tests. When I use 10s in integration tests they fail, as the Timeout is 5s only. I considered the following approaches:

  1. Expose the configuration via API (deferred to a follow up since this shouldn't be cherry-picked)
  2. Bump the timeout for the subset of integration tests for PodsReady - this would work, but seems wasteful
  3. Expose configuration which allows me to pass different values to SetupControllers (chosen)

Let me know if there is another approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh ok, let's keep this approach.

Copy link
Contributor

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 22, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 2fa23b305d3603f6365f0b44615630e3d0a1362b

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, mimowo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 22, 2024
@mimowo
Copy link
Contributor Author

mimowo commented Apr 22, 2024

Could you update API documentation (site)? Also, I'm not sure the reason why our CI didn't detect outdated API documentation...

Thanks for spotting the issue, I opened: #2032

@k8s-ci-robot k8s-ci-robot merged commit 63f46ac into kubernetes-sigs:main Apr 22, 2024
15 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.7 milestone Apr 22, 2024
@tenzen-y
Copy link
Member

/lgtm

/cherry-pick release-0.6

@k8s-infra-cherrypick-robot
Copy link
Contributor

@tenzen-y: #2025 failed to apply on top of branch "release-0.6":

Applying: Make the defaults for PodsReady backoff more practical
Using index info to reconstruct a base tree...
M	apis/config/v1beta1/configuration_types.go
M	pkg/controller/core/workload_controller.go
M	pkg/controller/core/workload_controller_test.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/controller/core/workload_controller_test.go
CONFLICT (content): Merge conflict in pkg/controller/core/workload_controller_test.go
Auto-merging pkg/controller/core/workload_controller.go
Auto-merging apis/config/v1beta1/configuration_types.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Make the defaults for PodsReady backoff more practical
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/lgtm

/cherry-pick release-0.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tenzen-y
Copy link
Member

@tenzen-y: #2025 failed to apply on top of branch "release-0.6":

Applying: Make the defaults for PodsReady backoff more practical
Using index info to reconstruct a base tree...
M	apis/config/v1beta1/configuration_types.go
M	pkg/controller/core/workload_controller.go
M	pkg/controller/core/workload_controller_test.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/controller/core/workload_controller_test.go
CONFLICT (content): Merge conflict in pkg/controller/core/workload_controller_test.go
Auto-merging pkg/controller/core/workload_controller.go
Auto-merging apis/config/v1beta1/configuration_types.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Make the defaults for PodsReady backoff more practical
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

@mimowo Could you submit a cherry-pick PR?

alculquicondor pushed a commit to alculquicondor/kueue that referenced this pull request Apr 22, 2024
k8s-ci-robot pushed a commit that referenced this pull request Apr 22, 2024
…ff more practical (#2033)

* Make the defaults for PodsReady backoff more practical

* Fix API reference for PodsReady config
@alculquicondor
Copy link
Contributor

/remove-kind documentation

@k8s-ci-robot k8s-ci-robot removed the kind/documentation Categorizes issue or PR as related to documentation. label May 8, 2024
@mimowo mimowo deleted the pods-ready-timeout-defaults branch May 29, 2024 14:53
kannon92 pushed a commit to openshift-kannon92/kubernetes-sigs-kueue that referenced this pull request Nov 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants