Alter bazel-test-canary to run on k8s-infa-prow-build #18607

spiffxp · 2020-08-01T21:35:01Z

Need to make some changes to do so:

stop using RBE
take a guess at resource limits (will likely need tuning since the
guess is based on runs that used RBE)

Need to make some changes to do so: - stop using RBE - take a guess at resource limits (will likely need tuning since the guess is based on runs that used RBE)

spiffxp · 2020-08-02T16:26:52Z

/cc @hasheddan @liggitt @BenTheElder
prep for #18582

BenTheElder · 2020-08-02T16:28:20Z

config/jobs/kubernetes/sig-testing/bazel-build-test.yaml

  annotations:
    testgrid-dashboards: sig-testing-canaries
    testgrid-tab-name: bazel-test
-    description: run kubernetes-bazel-test with latest image
+    description: run kubernetes-bazel-test without RBE on k8s-infra-prow-build


is "without RBE" necessary? only test-infra is using it currently, nominally

yes, there are kubernetes/kubernetes jobs that use RBE either via --config=ci or --config=remote (ref: https://github.com/kubernetes/kubernetes/blob/master/build/root/.bazelrc#L52-L87)

e.g.

periodic-bazel-test-master above

periodic-bazel-build-master

hasheddan

/lgtm
/hold if you want to adjust description

k8s-ci-robot · 2020-08-02T17:27:41Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hasheddan, spiffxp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~config/OWNERS~~ [spiffxp]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

spiffxp · 2020-08-03T02:01:03Z

/hold cancel

k8s-ci-robot · 2020-08-03T02:23:44Z

@spiffxp: Updated the job-config configmap in namespace default at cluster default using the following files:

key bazel-build-test.yaml using file config/jobs/kubernetes/sig-testing/bazel-build-test.yaml

In response to this:

Need to make some changes to do so:

stop using RBE

take a guess at resource limits (will likely need tuning since the
guess is based on runs that used RBE)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

spiffxp · 2020-08-03T22:44:39Z

Jobs that failed due to pod scheduling timeout since this merged:

None of them state what resource was disqualifying them from scheduling, next step is search the logs for k8s-infra-prow-build and look for references to the pods these jobs tried to schedule

hasheddan · 2020-08-03T22:47:19Z

@spiffxp I'm going to see if I can get that info presented in spyglass. However, I think it will require significant changes due to the fact that it isn't surfaced on the prowjob.json from what I can tell

spiffxp · 2020-08-03T23:13:54Z

Yeah I'm not clear on why the info isn't surfacing

spiffxp · 2020-08-03T23:21:43Z

Picking the most recent:

https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/periodic-kubernetes-bazel-test-canary/1290387226114396164
"Prow Job YAML"
https://prow.k8s.io/prowjob?prowjob=a035e1aa-d5c9-11ea-b94d-968fdbb2a550
pod_name: a035e1aa-d5c9-11ea-b94d-968fdbb2a550
Logging for resource Kubernetes Pod with text a035e1aa-d5c9-11ea-b94d-968fdbb2a550
2020-08-03 13:41:03.000 PDT "0/22 nodes are available: 12 Insufficient memory, 19 Insufficient cpu, 3 node(s) had taints that the pod didn't tolerate."
2020-08-03 13:41:08.000 PDT "pod triggered scale-up:"
2020-08-03 13:42:03.000 PDT "skip schedule deleting pod: test-pods/a035e1aa-d5c9-11ea-b94d-968fdbb2a550"

questions this raises:

when did the scale-up finish?
was the pod deleted before the scale-up finished?
why was the pod deleted?

spiffxp · 2020-08-03T23:25:22Z

https://console.cloud.google.com/monitoring/dashboards/custom/10925237040785467832?project=k8s-infra-prow-build&startTime=20200720T162400-07:00&endTime=20200803T162416-07:00 - well, we're definitely scaling up/down, which is a sign that the resource requests/limits are helping

spiffxp · 2020-08-04T00:01:51Z

Logs from k8s-prow (google.com) project

2020-08-03 13:40:54.000 PDT - horologium - Triggering new run of interval periodic.
2020-08-03 13:41:03.000 PDT - plank - Create Pod.
...
2020-08-03 13:41:15.000 PDT - crier - Successfully updated prowjob (jobStatus: pending)
2020-08-03 13:42:03.000 PDT - plank - Marked job for stale unscheduled pod as errored
2020-08-03 13:42:03.000 PDT - plank - Delete stale running pod
2020-08-03 13:42:04.000 PDT - crier - Successfully updated prowjob (jobStatus: error)
...
2020-08-03 13:55:07.000 PDT - sinker - Deleted prowjob.

So prow's the one doing the deleting. Who do I have to tell to wait longer?

spiffxp · 2020-08-04T00:06:19Z

"marked job for stale" comes from

test-infra/prow/plank/controller.go

Lines 390 to 405 in 86ee5af

    
           case corev1.PodPending: 
        
           	maxPodPending := c.config().Plank.PodPendingTimeout.Duration 
        
           	maxPodUnscheduled := c.config().Plank.PodUnscheduledTimeout.Duration 
        
           	if pod.Status.StartTime.IsZero() { 
        
           		if time.Since(pod.CreationTimestamp.Time) >= maxPodUnscheduled { 
        
           			// Pod is stuck in unscheduled state longer than maxPodUncheduled 
        
           			// abort the job 
        
           			pj.SetComplete() 
        
           			pj.Status.State = prowapi.ErrorState 
        
           			pj.Status.Description = "Pod scheduling timeout." 
        
           			c.log.WithFields(pjutil.ProwJobFields(&pj)).Info("Marked job for stale unscheduled pod as errored.") 
        
           			if err := c.deletePod(&pj); err != nil { 
        
           				return fmt.Errorf("failed to delete pod %s/%s in cluster %s: %w", pod.Namespace, pod.Name, pj.ClusterAlias(), err) 
        
           			} 
        
           			break 
        
           		}

spiffxp · 2020-08-04T00:07:47Z

And maxPodUnscheduled comes from...

test-infra/config/prow/config.yaml

Lines 7 to 8 in 86ee5af

    
           pod_pending_timeout: 15m 
        
           pod_unscheduled_timeout: 1m

OK, so I think 1m is too aggressive if we need to see scale-up succeed, the question is how long is reasonable?

spiffxp · 2020-08-04T00:10:31Z

How did we decide on 1m? #16089 says "I expect the pod_unscheduled_timeout change to have no effect: pods are always scheduled in under one second, or never scheduled at all."

Which was before we scheduled to clusters that autoscale

spiffxp · 2020-08-04T00:32:29Z

k8s-infra-prow-build - Kubernetes Cluster for text "gke-prow-build-pool1-2020043022092218-c3efd08d"

2020-08-03 13:41:06.000 PDT - Scale-up: setting group https://content.googleapis.com/compute/v1/projects/k8s-infra-prow-build/zones/us-central1-f/instanceGroups/gke-prow-build-pool1-2020043022092218-c3efd08d-grp size to 7
2020-08-03 13:41:08.000 PDT - Scale-up: group https://content.googleapis.com/compute/v1/projects/k8s-infra-prow-build/zones/us-central1-f/instanceGroups/gke-prow-build-pool1-2020043022092218-c3efd08d-grp size set to 7
2020-08-03 13:42:40.805 PDT - protoPayload.methodName: "io.k8s.core.v1.nodes.create",

start looking for text "gke-prow-build-pool1-2020043022092218-c3efd08d-plvr"

logs settle to a bunch of "kube-node-lease" updates by 13:44

looked for a tighter bound by looking for protoPayload.response.status.conditions.status="True", last entry by 13:42:49

tl;dr

took 2-3min from scale-up to node ready
I'd like to suggest 5min to give us room

spiffxp · 2020-08-04T00:43:17Z

Opened #18637 to propose a longer timeout

Alter bazel-test-canary to run on k8s-infa-prow-build

5c1770b

Need to make some changes to do so: - stop using RBE - take a guess at resource limits (will likely need tuning since the guess is based on runs that used RBE)

k8s-ci-robot requested review from chases2 and wojtek-t August 1, 2020 21:35

k8s-ci-robot requested review from BenTheElder, hasheddan and liggitt August 2, 2020 16:26

BenTheElder reviewed Aug 2, 2020

View reviewed changes

hasheddan approved these changes Aug 2, 2020

View reviewed changes

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 2, 2020

k8s-ci-robot assigned hasheddan Aug 2, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 2, 2020

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 3, 2020

spiffxp mentioned this pull request Aug 3, 2020

Swap periodic-kubernetes-bazel-test-master for canary #18613

Merged

k8s-ci-robot merged commit 83d572b into kubernetes:master Aug 3, 2020

k8s-ci-robot added this to the v1.19 milestone Aug 3, 2020

spiffxp deleted the k8s-infra-bazel-test-canary branch August 3, 2020 16:15

spiffxp mentioned this pull request Aug 4, 2020

Give k8s-infra-prow-build cluster time to scale-up #18637

Merged

This was referenced Aug 4, 2020

release-blocking jobs must run in dedicated cluster: periodic-kubernetes-bazel-test #18652

Closed

create or update periodic-kubernetes-bazel-build-canary job to run in k8s-infra-prow-build without RBE #18653

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alter bazel-test-canary to run on k8s-infa-prow-build #18607

Alter bazel-test-canary to run on k8s-infa-prow-build #18607

spiffxp commented Aug 1, 2020

spiffxp commented Aug 2, 2020

BenTheElder Aug 2, 2020

spiffxp Aug 3, 2020 •

edited

Loading

hasheddan left a comment

k8s-ci-robot commented Aug 2, 2020

spiffxp commented Aug 3, 2020

k8s-ci-robot commented Aug 3, 2020

spiffxp commented Aug 3, 2020

hasheddan commented Aug 3, 2020

spiffxp commented Aug 3, 2020

spiffxp commented Aug 3, 2020

spiffxp commented Aug 3, 2020

spiffxp commented Aug 4, 2020

spiffxp commented Aug 4, 2020

spiffxp commented Aug 4, 2020

spiffxp commented Aug 4, 2020

spiffxp commented Aug 4, 2020

spiffxp commented Aug 4, 2020

Alter bazel-test-canary to run on k8s-infa-prow-build #18607

Alter bazel-test-canary to run on k8s-infa-prow-build #18607

Conversation

spiffxp commented Aug 1, 2020

spiffxp commented Aug 2, 2020

BenTheElder Aug 2, 2020

Choose a reason for hiding this comment

spiffxp Aug 3, 2020 • edited Loading

Choose a reason for hiding this comment

hasheddan left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Aug 2, 2020

spiffxp commented Aug 3, 2020

k8s-ci-robot commented Aug 3, 2020

spiffxp commented Aug 3, 2020

hasheddan commented Aug 3, 2020

spiffxp commented Aug 3, 2020

spiffxp commented Aug 3, 2020

spiffxp commented Aug 3, 2020

spiffxp commented Aug 4, 2020

spiffxp commented Aug 4, 2020

spiffxp commented Aug 4, 2020

spiffxp commented Aug 4, 2020

spiffxp commented Aug 4, 2020

spiffxp commented Aug 4, 2020

spiffxp Aug 3, 2020 •

edited

Loading