bump kind presubmit jobs resources #20966

aojea · 2021-02-23T12:47:39Z

Since these jobs are flaking, mainly because pods times out waiting to be running, one guess is that this is due to resources constraint.

With no evidence, maybe try something and observe how it affects is a good approach

xref: #18825 (comment)

aojea · 2021-02-23T12:47:50Z

/assign @spiffxp @BenTheElder

spiffxp

I personally would rather see this done for one job first, whichever is highest traffic, to confirm whether this works

spiffxp · 2021-02-23T17:13:34Z

https://prow.k8s.io/?type=presubmit&job=*kind*&cluster=k8s-infra-prow-build behavior over the last 12h

spiffxp · 2021-02-23T17:29:43Z

https://storage.googleapis.com/k8s-gubernator/triage/index.html?job=kubernetes.*kind

There does seem to be a rise in this failure since ~2/19. The k8s-infra-prow-build cluster was updated to use a node-local dns cache add-on, and upgraded to v1.17 at that time (ref: kubernetes/k8s.io#1541 (comment) and kubernetes/k8s.io#1541 (comment))

It's worth noting, this is happening for ci jobs too, which are already at 7 cpu, e.g. https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-kind-e2e-parallel-1-20/1364237532232945664/prowjob.json

I would try looking into logs on the cluster e.g. resource.type=~"k8s_(cluster|node|pod)" resource.labels.project_id="k8s-infra-prow-build" and trying to filter down to see if you can identify a change in behavior in the last 5 days

BenTheElder · 2021-02-23T21:25:26Z

I personally would rather see this done for one job first, whichever is highest traffic, to confirm whether this works

+1, also we should be more aggressive for the one job, we should schedule up ~the whole node (7.X) so we can say "yes it works without noisy io neighbors" or "no we still see throttling + pod startup flakes". If it works then we can follow up with less aggressively tuning another job. If it doesn't this is only going to make it harder to schedule and there's some evidence that this may not be sufficient.

We also need to look into kubernetes/k8s.io#1187

aojea · 2021-02-23T21:27:51Z

/hold
this sounds like a plan

spiffxp · 2021-02-23T23:20:40Z

~the whole node (7.X)

This has gotten us into trouble almost every time someone does it. Stick to 7 cpu for the test container, the additional overhead of pod utils puts you at 7.4. That's more than enough to prevent other jobs from trying to sneak in.

BenTheElder · 2021-02-23T23:22:52Z

This has gotten us into trouble almost every time someone does it. Stick to 7 cpu for the test container, the additional overhead of pod utils puts you at 7.4. That's more than enough to prevent other jobs from trying to sneak in.

7.4 / ~8 sounds sufficient, I forgot about the podutils requests.

BenTheElder · 2021-02-24T09:28:34Z

that particular error string increased ~6 days ago?

kubernetes/kubernetes#99147 (comment) 🙃

aojea · 2021-02-24T10:27:26Z

that particular error string increased ~6 days ago?

kubernetes/kubernetes#99147 (comment)

hahah, at least we know the reasons, pods are not being able to be running and ready

spiffxp · 2021-02-25T17:53:08Z

#20966 (comment) - as a reminder, this graph is for CI jobs only, which already have 7 cpu.

kubernetes/k8s.io#1703 raised quota, if you want to modify this to update just pull-kubernetes-e2e-kind I'm willing to merge

spiffxp · 2021-02-25T18:05:41Z

(images from local grafana dashboard, someday I'll get this public, or setup a datastudio dashboard)

tl;dr I would suggest trying to figure out what changed between 2020-02-04 and 2020-02-07

Is this due to increase in traffic? Probably not. Looking back 6 months, failure rate wasn't this bad pre-holiday lull.

Looking at the last 90 days, for just the kind jobs, it's pretty clear pull-kubernetes-kind-e2e jumped sometime at the beginning of february

Looks like something changed between 2020-02-04 and 2020-02-07

Same story for the CI jobs

spiffxp · 2021-02-25T18:10:59Z

What I don't understand is why triage doesn't tell the same story:
https://storage.googleapis.com/k8s-gubernator/triage/index.html?date=2020-02-11&job=kind

aojea · 2021-02-26T10:24:57Z

What I don't understand is why triage doesn't tell the same story:
https://storage.googleapis.com/k8s-gubernator/triage/index.html?date=2020-02-11&job=kind

it is actually the opposite, the failure rate decreased after

tl;dr I would suggest trying to figure out what changed between 2020-02-04 and 2020-02-07

aojea · 2021-02-26T10:29:45Z

/hold cancel
just only kind-presubmit

spiffxp

/approve
/lgtm

k8s-ci-robot · 2021-02-26T15:37:40Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aojea, spiffxp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~config/OWNERS~~ [spiffxp]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

spiffxp · 2021-02-26T15:42:02Z

Triage filters out failures that don't fit into clusters of a certain size by failure text. So it's still the case that the jobs are failing more. Just not due to easily identified common error messages

k8s-ci-robot · 2021-02-26T15:48:22Z

@aojea: Updated the job-config configmap in namespace default at cluster test-infra-trusted using the following files:

key kubernetes-kind-presubmits.yaml using file config/jobs/kubernetes/sig-testing/kubernetes-kind-presubmits.yaml

In response to this:

Since these jobs are flaking, mainly because pods times out waiting to be running, one guess is that this is due to resources constraint.

With no evidence, maybe try something and observe how it affects is a good approach

xref: #18825 (comment)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 23, 2021

k8s-ci-robot assigned BenTheElder and spiffxp Feb 23, 2021

k8s-ci-robot requested review from spiffxp and wojtek-t February 23, 2021 12:47

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. area/config Issues or PRs related to code in /config area/jobs sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Feb 23, 2021

spiffxp suggested changes Feb 23, 2021

View reviewed changes

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 23, 2021

BenTheElder mentioned this pull request Feb 23, 2021

Switch gitVersion format to non-abbreviated hash kubernetes/kubernetes#99377

Merged

bump kind presubmit job cpu resources

291f595

aojea force-pushed the throttlingkind branch from 12686b8 to 291f595 Compare February 26, 2021 10:27

k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Feb 26, 2021

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 26, 2021

spiffxp approved these changes Feb 26, 2021

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 26, 2021

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 26, 2021

k8s-ci-robot merged commit 421e300 into kubernetes:master Feb 26, 2021

k8s-ci-robot added this to the v1.21 milestone Feb 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bump kind presubmit jobs resources #20966

bump kind presubmit jobs resources #20966

aojea commented Feb 23, 2021

aojea commented Feb 23, 2021

spiffxp left a comment

spiffxp commented Feb 23, 2021

spiffxp commented Feb 23, 2021

BenTheElder commented Feb 23, 2021

aojea commented Feb 23, 2021

spiffxp commented Feb 23, 2021

BenTheElder commented Feb 23, 2021 •

edited

Loading

BenTheElder commented Feb 24, 2021

aojea commented Feb 24, 2021

spiffxp commented Feb 25, 2021

spiffxp commented Feb 25, 2021 •

edited

Loading

spiffxp commented Feb 25, 2021

aojea commented Feb 26, 2021

aojea commented Feb 26, 2021

spiffxp left a comment

k8s-ci-robot commented Feb 26, 2021

spiffxp commented Feb 26, 2021

k8s-ci-robot commented Feb 26, 2021

bump kind presubmit jobs resources #20966

bump kind presubmit jobs resources #20966

Conversation

aojea commented Feb 23, 2021

aojea commented Feb 23, 2021

spiffxp left a comment

Choose a reason for hiding this comment

spiffxp commented Feb 23, 2021

spiffxp commented Feb 23, 2021

BenTheElder commented Feb 23, 2021

aojea commented Feb 23, 2021

spiffxp commented Feb 23, 2021

BenTheElder commented Feb 23, 2021 • edited Loading

BenTheElder commented Feb 24, 2021

aojea commented Feb 24, 2021

spiffxp commented Feb 25, 2021

spiffxp commented Feb 25, 2021 • edited Loading

spiffxp commented Feb 25, 2021

aojea commented Feb 26, 2021

aojea commented Feb 26, 2021

spiffxp left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Feb 26, 2021

spiffxp commented Feb 26, 2021

k8s-ci-robot commented Feb 26, 2021

BenTheElder commented Feb 23, 2021 •

edited

Loading

spiffxp commented Feb 25, 2021 •

edited

Loading