-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alter bazel-test-canary to run on k8s-infa-prow-build #18607
Alter bazel-test-canary to run on k8s-infa-prow-build #18607
Conversation
Need to make some changes to do so: - stop using RBE - take a guess at resource limits (will likely need tuning since the guess is based on runs that used RBE)
/cc @hasheddan @liggitt @BenTheElder |
annotations: | ||
testgrid-dashboards: sig-testing-canaries | ||
testgrid-tab-name: bazel-test | ||
description: run kubernetes-bazel-test with latest image | ||
description: run kubernetes-bazel-test without RBE on k8s-infra-prow-build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is "without RBE" necessary? only test-infra is using it currently, nominally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, there are kubernetes/kubernetes jobs that use RBE either via --config=ci
or --config=remote
(ref: https://github.com/kubernetes/kubernetes/blob/master/build/root/.bazelrc#L52-L87)
e.g.
- periodic-bazel-test-master above
- periodic-bazel-build-master
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/hold if you want to adjust description
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hasheddan, spiffxp The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold cancel |
@spiffxp: Updated the
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Jobs that failed due to pod scheduling timeout since this merged:
None of them state what resource was disqualifying them from scheduling, next step is search the logs for k8s-infra-prow-build and look for references to the pods these jobs tried to schedule |
@spiffxp I'm going to see if I can get that info presented in spyglass. However, I think it will require significant changes due to the fact that it isn't surfaced on the |
Yeah I'm not clear on why the info isn't surfacing |
Picking the most recent:
questions this raises:
|
https://console.cloud.google.com/monitoring/dashboards/custom/10925237040785467832?project=k8s-infra-prow-build&startTime=20200720T162400-07:00&endTime=20200803T162416-07:00 - well, we're definitely scaling up/down, which is a sign that the resource requests/limits are helping |
Logs from k8s-prow (google.com) project
So prow's the one doing the deleting. Who do I have to tell to wait longer? |
"marked job for stale" comes from test-infra/prow/plank/controller.go Lines 390 to 405 in 86ee5af
|
And maxPodUnscheduled comes from... test-infra/config/prow/config.yaml Lines 7 to 8 in 86ee5af
OK, so I think 1m is too aggressive if we need to see scale-up succeed, the question is how long is reasonable? |
How did we decide on 1m? #16089 says "I expect the pod_unscheduled_timeout change to have no effect: pods are always scheduled in under one second, or never scheduled at all." Which was before we scheduled to clusters that autoscale |
k8s-infra-prow-build - Kubernetes Cluster for text "gke-prow-build-pool1-2020043022092218-c3efd08d"
start looking for text "gke-prow-build-pool1-2020043022092218-c3efd08d-plvr" logs settle to a bunch of "kube-node-lease" updates by 13:44 looked for a tighter bound by looking for tl;dr
|
Opened #18637 to propose a longer timeout |
Need to make some changes to do so:
guess is based on runs that used RBE)