-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Give k8s-infra-prow-build cluster time to scale-up #18637
Give k8s-infra-prow-build cluster time to scale-up #18637
Conversation
We have k8s-infra-prow-build etup to autoscale if it doen't have enough capacity. Anecdotally I saw it take 5sec to decide that it needed to scale, but another ~2min for a node to come online. Let's assume it could take longer, so wait up to 5min before calling an unscheduled pod safe to delete.
/cc @BenTheElder @hasheddan /cc @chases2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Thanks @spiffxp!
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hasheddan, spiffxp The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/wg k8s-infra |
@spiffxp: Updated the
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
The job I did this for hasn't failed due to pod scheduling timeout since this landed https://testgrid.k8s.io/sig-testing-canaries#bazel-test&width=20 |
I think this has taken care of #18507 |
We have k8s-infra-prow-build setup to autoscale if it doen't have enough capacity. Anecdotally I saw it take 5sec to decide that it needed to scale, but another ~2min for a node to come online. Unfortunately plank deleted the pod after 1min, so we never got to benefit from the added capacity.
Let's assume it could take longer to scale up, so wait up to 5min before calling an unscheduled pod safe to delete.
This is based on investigation I did for period-kubernetes-bazel-test-canary refusing to schedule, ref: #18607 (comment)