Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alter bazel-test-canary to run on k8s-infa-prow-build #18607

Merged

Conversation

spiffxp
Copy link
Member

@spiffxp spiffxp commented Aug 1, 2020

Need to make some changes to do so:

  • stop using RBE
  • take a guess at resource limits (will likely need tuning since the
    guess is based on runs that used RBE)

Need to make some changes to do so:
- stop using RBE
- take a guess at resource limits (will likely need tuning since the
  guess is based on runs that used RBE)
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. area/config Issues or PRs related to code in /config area/jobs sig/testing Categorizes an issue or PR as relevant to SIG Testing. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Aug 1, 2020
@spiffxp
Copy link
Member Author

spiffxp commented Aug 2, 2020

/cc @hasheddan @liggitt @BenTheElder
prep for #18582

annotations:
testgrid-dashboards: sig-testing-canaries
testgrid-tab-name: bazel-test
description: run kubernetes-bazel-test with latest image
description: run kubernetes-bazel-test without RBE on k8s-infra-prow-build
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is "without RBE" necessary? only test-infra is using it currently, nominally

Copy link
Member Author

@spiffxp spiffxp Aug 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, there are kubernetes/kubernetes jobs that use RBE either via --config=ci or --config=remote (ref: https://github.com/kubernetes/kubernetes/blob/master/build/root/.bazelrc#L52-L87)

e.g.

  • periodic-bazel-test-master above
  • periodic-bazel-build-master

Copy link
Contributor

@hasheddan hasheddan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/hold if you want to adjust description

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 2, 2020
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 2, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hasheddan, spiffxp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@spiffxp
Copy link
Member Author

spiffxp commented Aug 3, 2020

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 3, 2020
@k8s-ci-robot k8s-ci-robot merged commit 83d572b into kubernetes:master Aug 3, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.19 milestone Aug 3, 2020
@k8s-ci-robot
Copy link
Contributor

@spiffxp: Updated the job-config configmap in namespace default at cluster default using the following files:

  • key bazel-build-test.yaml using file config/jobs/kubernetes/sig-testing/bazel-build-test.yaml

In response to this:

Need to make some changes to do so:

  • stop using RBE
  • take a guess at resource limits (will likely need tuning since the
    guess is based on runs that used RBE)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@spiffxp spiffxp deleted the k8s-infra-bazel-test-canary branch August 3, 2020 16:15
@spiffxp
Copy link
Member Author

spiffxp commented Aug 3, 2020

@hasheddan
Copy link
Contributor

@spiffxp I'm going to see if I can get that info presented in spyglass. However, I think it will require significant changes due to the fact that it isn't surfaced on the prowjob.json from what I can tell

@spiffxp
Copy link
Member Author

spiffxp commented Aug 3, 2020

Yeah I'm not clear on why the info isn't surfacing

@spiffxp
Copy link
Member Author

spiffxp commented Aug 3, 2020

Picking the most recent:

questions this raises:

  • when did the scale-up finish?
  • was the pod deleted before the scale-up finished?
  • why was the pod deleted?

@spiffxp
Copy link
Member Author

spiffxp commented Aug 3, 2020

https://console.cloud.google.com/monitoring/dashboards/custom/10925237040785467832?project=k8s-infra-prow-build&startTime=20200720T162400-07:00&endTime=20200803T162416-07:00 - well, we're definitely scaling up/down, which is a sign that the resource requests/limits are helping

@spiffxp
Copy link
Member Author

spiffxp commented Aug 4, 2020

Logs from k8s-prow (google.com) project

2020-08-03 13:40:54.000 PDT - horologium - Triggering new run of interval periodic.
2020-08-03 13:41:03.000 PDT - plank - Create Pod.
...
2020-08-03 13:41:15.000 PDT - crier - Successfully updated prowjob (jobStatus: pending)
2020-08-03 13:42:03.000 PDT - plank - Marked job for stale unscheduled pod as errored
2020-08-03 13:42:03.000 PDT - plank - Delete stale running pod
2020-08-03 13:42:04.000 PDT - crier - Successfully updated prowjob (jobStatus: error)
...
2020-08-03 13:55:07.000 PDT - sinker - Deleted prowjob.

So prow's the one doing the deleting. Who do I have to tell to wait longer?

@spiffxp
Copy link
Member Author

spiffxp commented Aug 4, 2020

"marked job for stale" comes from

case corev1.PodPending:
maxPodPending := c.config().Plank.PodPendingTimeout.Duration
maxPodUnscheduled := c.config().Plank.PodUnscheduledTimeout.Duration
if pod.Status.StartTime.IsZero() {
if time.Since(pod.CreationTimestamp.Time) >= maxPodUnscheduled {
// Pod is stuck in unscheduled state longer than maxPodUncheduled
// abort the job
pj.SetComplete()
pj.Status.State = prowapi.ErrorState
pj.Status.Description = "Pod scheduling timeout."
c.log.WithFields(pjutil.ProwJobFields(&pj)).Info("Marked job for stale unscheduled pod as errored.")
if err := c.deletePod(&pj); err != nil {
return fmt.Errorf("failed to delete pod %s/%s in cluster %s: %w", pod.Namespace, pod.Name, pj.ClusterAlias(), err)
}
break
}

@spiffxp
Copy link
Member Author

spiffxp commented Aug 4, 2020

And maxPodUnscheduled comes from...

pod_pending_timeout: 15m
pod_unscheduled_timeout: 1m

OK, so I think 1m is too aggressive if we need to see scale-up succeed, the question is how long is reasonable?

@spiffxp
Copy link
Member Author

spiffxp commented Aug 4, 2020

How did we decide on 1m? #16089 says "I expect the pod_unscheduled_timeout change to have no effect: pods are always scheduled in under one second, or never scheduled at all."

Which was before we scheduled to clusters that autoscale

@spiffxp
Copy link
Member Author

spiffxp commented Aug 4, 2020

k8s-infra-prow-build - Kubernetes Cluster for text "gke-prow-build-pool1-2020043022092218-c3efd08d"

2020-08-03 13:41:06.000 PDT - Scale-up: setting group https://content.googleapis.com/compute/v1/projects/k8s-infra-prow-build/zones/us-central1-f/instanceGroups/gke-prow-build-pool1-2020043022092218-c3efd08d-grp size to 7
2020-08-03 13:41:08.000 PDT - Scale-up: group https://content.googleapis.com/compute/v1/projects/k8s-infra-prow-build/zones/us-central1-f/instanceGroups/gke-prow-build-pool1-2020043022092218-c3efd08d-grp size set to 7
2020-08-03 13:42:40.805 PDT - protoPayload.methodName: "io.k8s.core.v1.nodes.create", 

start looking for text "gke-prow-build-pool1-2020043022092218-c3efd08d-plvr"

logs settle to a bunch of "kube-node-lease" updates by 13:44

looked for a tighter bound by looking for protoPayload.response.status.conditions.status="True", last entry by 13:42:49

tl;dr

  • took 2-3min from scale-up to node ready
  • I'd like to suggest 5min to give us room

@spiffxp
Copy link
Member Author

spiffxp commented Aug 4, 2020

Opened #18637 to propose a longer timeout

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/config Issues or PRs related to code in /config area/jobs cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants