Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

go.k8s.io/triage serving stale data, triage update job timing out at 2h #17625

Closed
spiffxp opened this issue May 18, 2020 · 21 comments
Closed

go.k8s.io/triage serving stale data, triage update job timing out at 2h #17625

spiffxp opened this issue May 18, 2020 · 21 comments
Labels
area/triage kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/testing Categorizes an issue or PR as relevant to SIG Testing.

Comments

@spiffxp
Copy link
Member

spiffxp commented May 18, 2020

What happened:
From: https://storage.googleapis.com/k8s-gubernator/triage/index.html (emphasis added)

1416 clusters of 116102 failures (7395 in last day) out of 145044 builds from 4/30/2020, 5:00:11 PM to 5/14/2020, 11:27:04 PM.

https://k8s-testgrid.appspot.com/sig-testing-misc#triage&width=5&graph-metrics=test-duration-minutes

The job started steadily increasing in duration from ~45m to 2h in the past 2 weeks, hitting its 2h threshold around 2020-05-13

What you expected to happen:
This job used to take 20m to complete.

Please provide links to example occurrences, if any:
First timeout: https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-test-infra-triage/1260410739387011072
Latest timeout: https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-test-infra-triage/1262367914212724738

Anything else we need to know?:
I wonder if the top failure cluster that started occuring around 2020-05-04 has anything to do with this?

Raising the job's timeout could ensure that fresh data lands

/area triage
/sig testing
/help
I don't have time to troubleshoot this today

@spiffxp spiffxp added the kind/bug Categorizes issue or PR as related to a bug. label May 18, 2020
@k8s-ci-robot k8s-ci-robot added area/triage help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels May 18, 2020
@aojea
Copy link
Member

aojea commented May 18, 2020

analyzing one occurrence for reference, however, I'm not familiar at all with these jobs so disregard it if it does not make sense:

+ pypy3 summarize.py triage_builds.json triage_tests/shard_000000000000.json triage_tests/shard_000000000001.json --output failure_data.json --output_slices slices/failure_data_PREFIX.json
I0513 03:29:11.685] done loading failed tests
I0513 03:29:11.686] Clustering failures for 2805 unique tests...
I0513 03:29:11.693]    1/2805, 16608 failures, BeforeSuite

...

I0513 03:35:52.052] Finished locally clustering 2805 unique tests (187794 failures) in 6m40s
I0513 03:36:18.969] done clustering inside each test
I0513 03:36:18.970] Combining clustered failures for 2805 unique tests...
I0513 03:36:18.989]    1/2805, 37 clusters, BeforeSuite

...

I0513 05:23:15.995] 2805/2805, 1 clusters, [GuestPolicies] [Package update] [centos-8]
I0513 05:23:15.999] Finished clustering 2805 unique tests (187670 failures) into 4470 clusters in 106m57s

I0513 05:23:52.498] done clustering across tests
I0513 05:23:52.499] Rendering results...
I0513 05:25:37.552] Finished rendering results in 1m45s

... 

- [46/256 files][  8.4 MiB/220.5 MiB]   3% Done 600.7 KiB/s ETA 00:06:02        
Copying file://slices/failure_data_46.json [Content-Type=application/json]...
- [47/256 files][  8.6 MiB/220.5 MiB]   3% Done 608.1 KiB/s ETA 00:05:57        
{"component":"entrypoint","file":"prow/entrypoint/run.go:245","func":"k8s.io/test-infra/prow/entrypoint.gracefullyTerminate","level":"error","msg":"Process did not exit before 15s grace period","time":"2020-05-13T05:26:18Z"}

I0513 05:23:15.999] Finished clustering 2805 unique tests (187670 failures) into 4470 clusters in 106m57s <<<< is this normal?

@dims
Copy link
Member

dims commented May 18, 2020

@aojea the green jobs also had similar number of tests, but yes the red jobs had a lot more failures (example 2877 tests and 208690 failures in https://storage.googleapis.com/kubernetes-jenkins/logs/ci-test-infra-triage/1262429570640908288/build-log.txt)

Let's bump the timeout for now.

@spiffxp
Copy link
Member Author

spiffxp commented May 19, 2020

The quantity of tests and failures doesn't seem wildly out of tolerance. The duration is.

I have a sneaking suspicion that integration tests are dumping a ton of output that doesn't cluster well, and are repeatedly hitting the worst case scenario of computing edit distance against every single cluster of failures.

@spiffxp
Copy link
Member Author

spiffxp commented May 19, 2020

This should get us under timeout, but still not as fast as it used to be. ref: #17643

@spiffxp
Copy link
Member Author

spiffxp commented May 20, 2020

In case it's useful, I looked back 800 builds when I was trying to troubleshoot what happened. It's not that we're receiving a lot more test failures than we used to, but that what we're receiving isn't clustering well.

https://docs.google.com/spreadsheets/d/1LkzEBhUJJH_6RCV6Q3kFTMMO64iDkCC9x0t5tGNlmX4/edit
Screen Shot 2020-05-18 at 5 00 34 PM

@dims
Copy link
Member

dims commented May 21, 2020

Nice graphs :)

@aojea
Copy link
Member

aojea commented May 21, 2020

indeed, who can argue against that ;)

@spiffxp
Copy link
Member Author

spiffxp commented Jun 4, 2020

The triage job has trended downward over the last two weeks back to ~20m, I bumped the frequency to 30m #17833

Leaving this open to possibly root cause why triage started having issues. If it was a "garbage in, garbage out" scenario, could we sanitize or defend against such data? If we made triage more performant, would it just naturally handle this within reasonable time?

@spiffxp
Copy link
Member Author

spiffxp commented Jun 4, 2020

https://testgrid.k8s.io/sig-testing-misc#triage&width=5&graph-metrics=test-duration-minutes
Screen Shot 2020-06-04 at 10 15 57 AM

@spiffxp
Copy link
Member Author

spiffxp commented Jul 20, 2020

It's creeping back up to the 2h (or more) range
Screen Shot 2020-07-20 at 3 59 16 PM

@spiffxp
Copy link
Member Author

spiffxp commented Jul 21, 2020

1675 clusters of 162198 failures (5057 in last day) out of 139753 builds from 7/5/2020, 5:00:36 PM to 7/20/2020, 12:59:29 AM.

I'm not entirely sure what's going on with kettle, but it appears we're lagging on updates to the build tables

https://testgrid.k8s.io/sig-testing-misc#metrics-kettle&width=5

@spiffxp
Copy link
Member Author

spiffxp commented Aug 7, 2020

Near 60min
Screen Shot 2020-08-07 at 9 41 31 AM

Also, a wild rewrite in go has appeared! (ref: #18726)

https://testgrid.k8s.io/sig-testing-misc#triage-go is looking close to 30min. Will still need time to vet the results and see what other optimizations are possible before flipping go.k8s.io/triage to use this.

@spiffxp
Copy link
Member Author

spiffxp commented Aug 22, 2020

/remove-help
I believe this has been addressed by #18726

@k8s-ci-robot k8s-ci-robot removed the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Aug 22, 2020
@spiffxp
Copy link
Member Author

spiffxp commented Aug 22, 2020

So, I think the triage rewrite has addressed the "job times out at 2h" part of this issue.

I reopened #13432 because I suspect that the data source triage uses is out of date from ~1am PT - 12pm PT, which means go.k8s.io/triage is stale during CET working hours

@spiffxp
Copy link
Member Author

spiffxp commented Aug 26, 2020

D'oh, and we're back to 2+ hour runs. Unfortunately triage was failing due to flag parsing reasons so it's not clear precisely when or how this occurred. Makes me wonder what happened with test results during that window. Did we suddenly start getting a lot more?
Screen Shot 2020-08-26 at 1 30 20 PM

@spiffxp
Copy link
Member Author

spiffxp commented Aug 26, 2020

https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-test-infra-triage/1298417743338409985 - the highest duration failure from the above graph

Has this in its log

I0826 01:17:28.164179     141 cluster.go:145] Finished locally clustering 2992 unique tests (272394 failures) in 5m2.521312174s
I0826 02:41:08.214465     141 files.go:141] Done clustering inside each test

That's an 80 minute pause... spent doing what? GC? Dealing with swap? Fighting a noisy neighbor?

Glanced at 2 of the latest ~100min runs and they don't have anything like that.

@BenTheElder
Copy link
Member

BenTheElder commented Aug 28, 2020

Not swap (there's no swap). Maybe noisy neighbor but for that long seems implausible.

This sounds more like pathological clustering

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 26, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 26, 2020
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/triage kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Projects
None yet
Development

No branches or pull requests

6 participants