-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
go.k8s.io/triage serving stale data, triage update job timing out at 2h #17625
Comments
analyzing one occurrence for reference, however, I'm not familiar at all with these jobs so disregard it if it does not make sense:
I0513 05:23:15.999] Finished clustering 2805 unique tests (187670 failures) into 4470 clusters in 106m57s <<<< is this normal? |
@aojea the green jobs also had similar number of tests, but yes the red jobs had a lot more failures (example 2877 tests and 208690 failures in https://storage.googleapis.com/kubernetes-jenkins/logs/ci-test-infra-triage/1262429570640908288/build-log.txt) Let's bump the timeout for now. |
The quantity of tests and failures doesn't seem wildly out of tolerance. The duration is. I have a sneaking suspicion that integration tests are dumping a ton of output that doesn't cluster well, and are repeatedly hitting the worst case scenario of computing edit distance against every single cluster of failures. |
This should get us under timeout, but still not as fast as it used to be. ref: #17643 |
In case it's useful, I looked back 800 builds when I was trying to troubleshoot what happened. It's not that we're receiving a lot more test failures than we used to, but that what we're receiving isn't clustering well. https://docs.google.com/spreadsheets/d/1LkzEBhUJJH_6RCV6Q3kFTMMO64iDkCC9x0t5tGNlmX4/edit |
Nice graphs :) |
indeed, who can argue against that ;) |
The triage job has trended downward over the last two weeks back to ~20m, I bumped the frequency to 30m #17833 Leaving this open to possibly root cause why triage started having issues. If it was a "garbage in, garbage out" scenario, could we sanitize or defend against such data? If we made triage more performant, would it just naturally handle this within reasonable time? |
I'm not entirely sure what's going on with kettle, but it appears we're lagging on updates to the build tables https://testgrid.k8s.io/sig-testing-misc#metrics-kettle&width=5 |
Also, a wild rewrite in go has appeared! (ref: #18726) https://testgrid.k8s.io/sig-testing-misc#triage-go is looking close to 30min. Will still need time to vet the results and see what other optimizations are possible before flipping go.k8s.io/triage to use this. |
/remove-help |
So, I think the triage rewrite has addressed the "job times out at 2h" part of this issue. I reopened #13432 because I suspect that the data source triage uses is out of date from ~1am PT - 12pm PT, which means go.k8s.io/triage is stale during CET working hours |
https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-test-infra-triage/1298417743338409985 - the highest duration failure from the above graph Has this in its log
That's an 80 minute pause... spent doing what? GC? Dealing with swap? Fighting a noisy neighbor? Glanced at 2 of the latest ~100min runs and they don't have anything like that. |
Not swap (there's no swap). Maybe noisy neighbor but for that long seems implausible. This sounds more like pathological clustering |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What happened:
From: https://storage.googleapis.com/k8s-gubernator/triage/index.html (emphasis added)
https://k8s-testgrid.appspot.com/sig-testing-misc#triage&width=5&graph-metrics=test-duration-minutes
The job started steadily increasing in duration from ~45m to 2h in the past 2 weeks, hitting its 2h threshold around 2020-05-13
What you expected to happen:
This job used to take 20m to complete.
Please provide links to example occurrences, if any:
First timeout: https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-test-infra-triage/1260410739387011072
Latest timeout: https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-test-infra-triage/1262367914212724738
Anything else we need to know?:
I wonder if the top failure cluster that started occuring around 2020-05-04 has anything to do with this?
Raising the job's timeout could ensure that fresh data lands
/area triage
/sig testing
/help
I don't have time to troubleshoot this today
The text was updated successfully, but these errors were encountered: