-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bump kind presubmit jobs resources #20966
Conversation
/assign @spiffxp @BenTheElder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally would rather see this done for one job first, whichever is highest traffic, to confirm whether this works
https://prow.k8s.io/?type=presubmit&job=*kind*&cluster=k8s-infra-prow-build behavior over the last 12h |
https://storage.googleapis.com/k8s-gubernator/triage/index.html?job=kubernetes.*kind There does seem to be a rise in this failure since ~2/19. The k8s-infra-prow-build cluster was updated to use a node-local dns cache add-on, and upgraded to v1.17 at that time (ref: kubernetes/k8s.io#1541 (comment) and kubernetes/k8s.io#1541 (comment)) It's worth noting, this is happening for ci jobs too, which are already at 7 cpu, e.g. https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-kind-e2e-parallel-1-20/1364237532232945664/prowjob.json I would try looking into logs on the cluster e.g. |
+1, also we should be more aggressive for the one job, we should schedule up ~the whole node (7.X) so we can say "yes it works without noisy io neighbors" or "no we still see throttling + pod startup flakes". If it works then we can follow up with less aggressively tuning another job. If it doesn't this is only going to make it harder to schedule and there's some evidence that this may not be sufficient. We also need to look into kubernetes/k8s.io#1187 |
/hold |
This has gotten us into trouble almost every time someone does it. Stick to 7 cpu for the test container, the additional overhead of pod utils puts you at 7.4. That's more than enough to prevent other jobs from trying to sneak in. |
7.4 / ~8 sounds sufficient, I forgot about the podutils requests. |
that particular error string increased ~6 days ago? |
hahah, at least we know the reasons, pods are not being able to be running and ready |
#20966 (comment) - as a reminder, this graph is for CI jobs only, which already have 7 cpu. kubernetes/k8s.io#1703 raised quota, if you want to modify this to update just |
What I don't understand is why triage doesn't tell the same story: |
it is actually the opposite, the failure rate decreased after
|
12686b8
to
291f595
Compare
/hold cancel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: aojea, spiffxp The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Triage filters out failures that don't fit into clusters of a certain size by failure text. So it's still the case that the jobs are failing more. Just not due to easily identified common error messages |
@aojea: Updated the
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Since these jobs are flaking, mainly because pods times out waiting to be running, one guess is that this is due to resources constraint.
With no evidence, maybe try something and observe how it affects is a good approach
xref: #18825 (comment)