Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bump kind presubmit jobs resources #20966

Merged
merged 1 commit into from
Feb 26, 2021

Conversation

aojea
Copy link
Member

@aojea aojea commented Feb 23, 2021

Since these jobs are flaking, mainly because pods times out waiting to be running, one guess is that this is due to resources constraint.

With no evidence, maybe try something and observe how it affects is a good approach

xref: #18825 (comment)

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 23, 2021
@aojea
Copy link
Member Author

aojea commented Feb 23, 2021

/assign @spiffxp @BenTheElder

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. area/config Issues or PRs related to code in /config area/jobs sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Feb 23, 2021
Copy link
Member

@spiffxp spiffxp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally would rather see this done for one job first, whichever is highest traffic, to confirm whether this works

@spiffxp
Copy link
Member

spiffxp commented Feb 23, 2021

@spiffxp
Copy link
Member

spiffxp commented Feb 23, 2021

https://storage.googleapis.com/k8s-gubernator/triage/index.html?job=kubernetes.*kind
Screen Shot 2021-02-23 at 12 17 37 PM

There does seem to be a rise in this failure since ~2/19. The k8s-infra-prow-build cluster was updated to use a node-local dns cache add-on, and upgraded to v1.17 at that time (ref: kubernetes/k8s.io#1541 (comment) and kubernetes/k8s.io#1541 (comment))

It's worth noting, this is happening for ci jobs too, which are already at 7 cpu, e.g. https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-kind-e2e-parallel-1-20/1364237532232945664/prowjob.json

I would try looking into logs on the cluster e.g. resource.type=~"k8s_(cluster|node|pod)" resource.labels.project_id="k8s-infra-prow-build" and trying to filter down to see if you can identify a change in behavior in the last 5 days

@BenTheElder
Copy link
Member

I personally would rather see this done for one job first, whichever is highest traffic, to confirm whether this works

+1, also we should be more aggressive for the one job, we should schedule up ~the whole node (7.X) so we can say "yes it works without noisy io neighbors" or "no we still see throttling + pod startup flakes". If it works then we can follow up with less aggressively tuning another job. If it doesn't this is only going to make it harder to schedule and there's some evidence that this may not be sufficient.

We also need to look into kubernetes/k8s.io#1187

@aojea
Copy link
Member Author

aojea commented Feb 23, 2021

/hold
this sounds like a plan

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 23, 2021
@spiffxp
Copy link
Member

spiffxp commented Feb 23, 2021

~the whole node (7.X)

This has gotten us into trouble almost every time someone does it. Stick to 7 cpu for the test container, the additional overhead of pod utils puts you at 7.4. That's more than enough to prevent other jobs from trying to sneak in.

@BenTheElder
Copy link
Member

BenTheElder commented Feb 23, 2021

This has gotten us into trouble almost every time someone does it. Stick to 7 cpu for the test container, the additional overhead of pod utils puts you at 7.4. That's more than enough to prevent other jobs from trying to sneak in.

7.4 / ~8 sounds sufficient, I forgot about the podutils requests.

@BenTheElder
Copy link
Member

that particular error string increased ~6 days ago?

kubernetes/kubernetes#99147 (comment) 🙃

@aojea
Copy link
Member Author

aojea commented Feb 24, 2021

that particular error string increased ~6 days ago?

kubernetes/kubernetes#99147 (comment)

hahah, at least we know the reasons, pods are not being able to be running and ready

@spiffxp
Copy link
Member

spiffxp commented Feb 25, 2021

#20966 (comment) - as a reminder, this graph is for CI jobs only, which already have 7 cpu.

kubernetes/k8s.io#1703 raised quota, if you want to modify this to update just pull-kubernetes-e2e-kind I'm willing to merge

@spiffxp
Copy link
Member

spiffxp commented Feb 25, 2021

(images from local grafana dashboard, someday I'll get this public, or setup a datastudio dashboard)

tl;dr I would suggest trying to figure out what changed between 2020-02-04 and 2020-02-07

Is this due to increase in traffic? Probably not. Looking back 6 months, failure rate wasn't this bad pre-holiday lull.
Screen Shot 2021-02-25 at 12 58 48 PM

Looking at the last 90 days, for just the kind jobs, it's pretty clear pull-kubernetes-kind-e2e jumped sometime at the beginning of february
Screen Shot 2021-02-25 at 1 00 47 PM

Looks like something changed between 2020-02-04 and 2020-02-07
Screen Shot 2021-02-25 at 1 03 22 PM

Same story for the CI jobs
Screen Shot 2021-02-25 at 1 05 13 PM

@spiffxp
Copy link
Member

spiffxp commented Feb 25, 2021

What I don't understand is why triage doesn't tell the same story:
https://storage.googleapis.com/k8s-gubernator/triage/index.html?date=2020-02-11&job=kind
Screen Shot 2021-02-25 at 1 09 20 PM

@aojea
Copy link
Member Author

aojea commented Feb 26, 2021

What I don't understand is why triage doesn't tell the same story:
https://storage.googleapis.com/k8s-gubernator/triage/index.html?date=2020-02-11&job=kind
Screen Shot 2021-02-25 at 1 09 20 PM

it is actually the opposite, the failure rate decreased after

tl;dr I would suggest trying to figure out what changed between 2020-02-04 and 2020-02-07

@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Feb 26, 2021
@aojea
Copy link
Member Author

aojea commented Feb 26, 2021

/hold cancel
just only kind-presubmit

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 26, 2021
Copy link
Member

@spiffxp spiffxp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 26, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aojea, spiffxp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 26, 2021
@spiffxp
Copy link
Member

spiffxp commented Feb 26, 2021

Triage filters out failures that don't fit into clusters of a certain size by failure text. So it's still the case that the jobs are failing more. Just not due to easily identified common error messages

@k8s-ci-robot k8s-ci-robot merged commit 421e300 into kubernetes:master Feb 26, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.21 milestone Feb 26, 2021
@k8s-ci-robot
Copy link
Contributor

@aojea: Updated the job-config configmap in namespace default at cluster test-infra-trusted using the following files:

  • key kubernetes-kind-presubmits.yaml using file config/jobs/kubernetes/sig-testing/kubernetes-kind-presubmits.yaml

In response to this:

Since these jobs are flaking, mainly because pods times out waiting to be running, one guess is that this is due to resources constraint.

With no evidence, maybe try something and observe how it affects is a good approach

xref: #18825 (comment)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/config Issues or PRs related to code in /config area/jobs cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants