Give k8s-infra-prow-build cluster time to scale-up #18637

spiffxp · 2020-08-04T00:41:18Z

We have k8s-infra-prow-build setup to autoscale if it doen't have enough capacity. Anecdotally I saw it take 5sec to decide that it needed to scale, but another ~2min for a node to come online. Unfortunately plank deleted the pod after 1min, so we never got to benefit from the added capacity.

Let's assume it could take longer to scale up, so wait up to 5min before calling an unscheduled pod safe to delete.

This is based on investigation I did for period-kubernetes-bazel-test-canary refusing to schedule, ref: #18607 (comment)

We have k8s-infra-prow-build etup to autoscale if it doen't have enough capacity. Anecdotally I saw it take 5sec to decide that it needed to scale, but another ~2min for a node to come online. Let's assume it could take longer, so wait up to 5min before calling an unscheduled pod safe to delete.

spiffxp · 2020-08-04T00:42:37Z

/cc @BenTheElder @hasheddan
I think this may improve some Pod scheduling timeout woes in general

/cc @chases2
FYI as test-infra oncall

hasheddan

Nice! Thanks @spiffxp!

/lgtm

k8s-ci-robot · 2020-08-04T00:49:25Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hasheddan, spiffxp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~config/OWNERS~~ [spiffxp]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

spiffxp · 2020-08-04T01:02:41Z

/wg k8s-infra

spiffxp · 2020-08-04T01:04:33Z

This should help with migrate release-blocking jobs to k8s-infra-prow-build (#18549) and especially help with merge-blocking #18550

k8s-ci-robot · 2020-08-04T01:06:24Z

@spiffxp: Updated the config configmap in namespace default at cluster default using the following files:

key config.yaml using file config/prow/config.yaml

In response to this:

We have k8s-infra-prow-build setup to autoscale if it doen't have enough capacity. Anecdotally I saw it take 5sec to decide that it needed to scale, but another ~2min for a node to come online. Unfortunately plank deleted the pod after 1min, so we never got to benefit from the added capacity.

Let's assume it could take longer to scale up, so wait up to 5min before calling an unscheduled pod safe to delete.

This is based on investigation I did for period-kubernetes-bazel-test-canary refusing to schedule, ref: #18607 (comment)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

spiffxp · 2020-08-04T16:04:02Z

The job I did this for hasn't failed due to pod scheduling timeout since this landed https://testgrid.k8s.io/sig-testing-canaries#bazel-test&width=20

spiffxp · 2020-08-04T16:09:02Z

I think this has taken care of #18507

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Aug 4, 2020

k8s-ci-robot requested review from cblecker, chases2 and wojtek-t August 4, 2020 00:41

k8s-ci-robot added area/config Issues or PRs related to code in /config sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Aug 4, 2020

k8s-ci-robot requested review from BenTheElder and hasheddan August 4, 2020 00:42

spiffxp mentioned this pull request Aug 4, 2020

Alter bazel-test-canary to run on k8s-infa-prow-build #18607

Merged

hasheddan approved these changes Aug 4, 2020

View reviewed changes

k8s-ci-robot assigned hasheddan Aug 4, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 4, 2020

k8s-ci-robot added the wg/k8s-infra label Aug 4, 2020

k8s-ci-robot merged commit bdabd60 into kubernetes:master Aug 4, 2020

k8s-ci-robot added this to the v1.19 milestone Aug 4, 2020

spiffxp deleted the good-nodes-come-to-those-who-wait branch August 4, 2020 01:32

spiffxp mentioned this pull request Aug 4, 2020

ci-kubernetes-conformance-kind-ga-only is experiencing pod scheduling timeout #18507

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Give k8s-infra-prow-build cluster time to scale-up #18637

Give k8s-infra-prow-build cluster time to scale-up #18637

spiffxp commented Aug 4, 2020 •

edited

Loading

spiffxp commented Aug 4, 2020

hasheddan left a comment

k8s-ci-robot commented Aug 4, 2020

spiffxp commented Aug 4, 2020

spiffxp commented Aug 4, 2020

k8s-ci-robot commented Aug 4, 2020

spiffxp commented Aug 4, 2020

spiffxp commented Aug 4, 2020

Give k8s-infra-prow-build cluster time to scale-up #18637

Give k8s-infra-prow-build cluster time to scale-up #18637

Conversation

spiffxp commented Aug 4, 2020 • edited Loading

spiffxp commented Aug 4, 2020

hasheddan left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Aug 4, 2020

spiffxp commented Aug 4, 2020

spiffxp commented Aug 4, 2020

k8s-ci-robot commented Aug 4, 2020

spiffxp commented Aug 4, 2020

spiffxp commented Aug 4, 2020

spiffxp commented Aug 4, 2020 •

edited

Loading