go.k8s.io/triage serving stale data, triage update job timing out at 2h #17625

spiffxp · 2020-05-18T16:33:15Z

What happened:
From: https://storage.googleapis.com/k8s-gubernator/triage/index.html (emphasis added)

1416 clusters of 116102 failures (7395 in last day) out of 145044 builds from 4/30/2020, 5:00:11 PM to 5/14/2020, 11:27:04 PM.

https://k8s-testgrid.appspot.com/sig-testing-misc#triage&width=5&graph-metrics=test-duration-minutes

The job started steadily increasing in duration from ~45m to 2h in the past 2 weeks, hitting its 2h threshold around 2020-05-13

What you expected to happen:
This job used to take 20m to complete.

Please provide links to example occurrences, if any:
First timeout: https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-test-infra-triage/1260410739387011072
Latest timeout: https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-test-infra-triage/1262367914212724738

Anything else we need to know?:
I wonder if the top failure cluster that started occuring around 2020-05-04 has anything to do with this?

Raising the job's timeout could ensure that fresh data lands

/area triage
/sig testing
/help
I don't have time to troubleshoot this today

aojea · 2020-05-18T22:25:00Z

analyzing one occurrence for reference, however, I'm not familiar at all with these jobs so disregard it if it does not make sense:

+ pypy3 summarize.py triage_builds.json triage_tests/shard_000000000000.json triage_tests/shard_000000000001.json --output failure_data.json --output_slices slices/failure_data_PREFIX.json
I0513 03:29:11.685] done loading failed tests
I0513 03:29:11.686] Clustering failures for 2805 unique tests...
I0513 03:29:11.693]    1/2805, 16608 failures, BeforeSuite

...

I0513 03:35:52.052] Finished locally clustering 2805 unique tests (187794 failures) in 6m40s
I0513 03:36:18.969] done clustering inside each test
I0513 03:36:18.970] Combining clustered failures for 2805 unique tests...
I0513 03:36:18.989]    1/2805, 37 clusters, BeforeSuite

...

I0513 05:23:15.995] 2805/2805, 1 clusters, [GuestPolicies] [Package update] [centos-8]
I0513 05:23:15.999] Finished clustering 2805 unique tests (187670 failures) into 4470 clusters in 106m57s

I0513 05:23:52.498] done clustering across tests
I0513 05:23:52.499] Rendering results...
I0513 05:25:37.552] Finished rendering results in 1m45s

... 

- [46/256 files][  8.4 MiB/220.5 MiB]   3% Done 600.7 KiB/s ETA 00:06:02        
Copying file://slices/failure_data_46.json [Content-Type=application/json]...
- [47/256 files][  8.6 MiB/220.5 MiB]   3% Done 608.1 KiB/s ETA 00:05:57        
{"component":"entrypoint","file":"prow/entrypoint/run.go:245","func":"k8s.io/test-infra/prow/entrypoint.gracefullyTerminate","level":"error","msg":"Process did not exit before 15s grace period","time":"2020-05-13T05:26:18Z"}

I0513 05:23:15.999] Finished clustering 2805 unique tests (187670 failures) into 4470 clusters in 106m57s <<<< is this normal?

dims · 2020-05-18T23:12:12Z

@aojea the green jobs also had similar number of tests, but yes the red jobs had a lot more failures (example 2877 tests and 208690 failures in https://storage.googleapis.com/kubernetes-jenkins/logs/ci-test-infra-triage/1262429570640908288/build-log.txt)

Let's bump the timeout for now.

spiffxp · 2020-05-19T16:24:12Z

The quantity of tests and failures doesn't seem wildly out of tolerance. The duration is.

I have a sneaking suspicion that integration tests are dumping a ton of output that doesn't cluster well, and are repeatedly hitting the worst case scenario of computing edit distance against every single cluster of failures.

spiffxp · 2020-05-19T23:33:21Z

This should get us under timeout, but still not as fast as it used to be. ref: #17643

spiffxp · 2020-05-20T22:44:02Z

In case it's useful, I looked back 800 builds when I was trying to troubleshoot what happened. It's not that we're receiving a lot more test failures than we used to, but that what we're receiving isn't clustering well.

https://docs.google.com/spreadsheets/d/1LkzEBhUJJH_6RCV6Q3kFTMMO64iDkCC9x0t5tGNlmX4/edit

dims · 2020-05-21T17:49:25Z

Nice graphs :)

aojea · 2020-05-21T19:55:06Z

indeed, who can argue against that ;)

spiffxp · 2020-06-04T17:14:23Z

The triage job has trended downward over the last two weeks back to ~20m, I bumped the frequency to 30m #17833

Leaving this open to possibly root cause why triage started having issues. If it was a "garbage in, garbage out" scenario, could we sanitize or defend against such data? If we made triage more performant, would it just naturally handle this within reasonable time?

spiffxp · 2020-06-04T17:16:24Z

https://testgrid.k8s.io/sig-testing-misc#triage&width=5&graph-metrics=test-duration-minutes

spiffxp · 2020-07-20T22:59:42Z

It's creeping back up to the 2h (or more) range

spiffxp · 2020-07-21T00:32:18Z

1675 clusters of 162198 failures (5057 in last day) out of 139753 builds from 7/5/2020, 5:00:36 PM to 7/20/2020, 12:59:29 AM.

I'm not entirely sure what's going on with kettle, but it appears we're lagging on updates to the build tables

https://testgrid.k8s.io/sig-testing-misc#metrics-kettle&width=5

spiffxp · 2020-08-07T16:43:54Z

Near 60min

Also, a wild rewrite in go has appeared! (ref: #18726)

https://testgrid.k8s.io/sig-testing-misc#triage-go is looking close to 30min. Will still need time to vet the results and see what other optimizations are possible before flipping go.k8s.io/triage to use this.

spiffxp · 2020-08-22T00:22:13Z

/remove-help
I believe this has been addressed by #18726

spiffxp · 2020-08-22T01:29:53Z

So, I think the triage rewrite has addressed the "job times out at 2h" part of this issue.

I reopened #13432 because I suspect that the data source triage uses is out of date from ~1am PT - 12pm PT, which means go.k8s.io/triage is stale during CET working hours

spiffxp · 2020-08-26T20:33:22Z

D'oh, and we're back to 2+ hour runs. Unfortunately triage was failing due to flag parsing reasons so it's not clear precisely when or how this occurred. Makes me wonder what happened with test results during that window. Did we suddenly start getting a lot more?

spiffxp · 2020-08-26T20:36:21Z

https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-test-infra-triage/1298417743338409985 - the highest duration failure from the above graph

Has this in its log

I0826 01:17:28.164179     141 cluster.go:145] Finished locally clustering 2992 unique tests (272394 failures) in 5m2.521312174s
I0826 02:41:08.214465     141 files.go:141] Done clustering inside each test

That's an 80 minute pause... spent doing what? GC? Dealing with swap? Fighting a noisy neighbor?

Glanced at 2 of the latest ~100min runs and they don't have anything like that.

BenTheElder · 2020-08-28T08:53:26Z

Not swap (there's no swap). Maybe noisy neighbor but for that long seems implausible.

This sounds more like pathological clustering

fejta-bot · 2020-11-26T11:10:19Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2020-12-26T11:55:04Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2021-01-25T12:38:47Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2021-01-25T12:38:58Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

spiffxp added the kind/bug Categorizes issue or PR as related to a bug. label May 18, 2020

k8s-ci-robot added area/triage help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels May 18, 2020

This was referenced May 19, 2020

triage: bail on certain global clusters after 30s #17643

Merged

bump timeout and reduce frequency of ci-test-infra-triage #17629

Merged

spiffxp mentioned this issue May 21, 2020

Run triage every 2h #17668

Merged

spiffxp mentioned this issue Jun 4, 2020

Update triage more frequently #17833

Merged

spiffxp mentioned this issue Aug 7, 2020

Rewrite triage summarizer in go #18726

Closed

k8s-ci-robot removed the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Aug 22, 2020

spiffxp mentioned this issue Sep 1, 2020

go.k8s.io/triage cluster data is much larger than usual #19078

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 26, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 26, 2020

k8s-ci-robot closed this as completed Jan 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

go.k8s.io/triage serving stale data, triage update job timing out at 2h #17625

go.k8s.io/triage serving stale data, triage update job timing out at 2h #17625

spiffxp commented May 18, 2020

aojea commented May 18, 2020

dims commented May 18, 2020

spiffxp commented May 19, 2020

spiffxp commented May 19, 2020

spiffxp commented May 20, 2020

dims commented May 21, 2020

aojea commented May 21, 2020

spiffxp commented Jun 4, 2020 •

edited

Loading

spiffxp commented Jun 4, 2020

spiffxp commented Jul 20, 2020

spiffxp commented Jul 21, 2020

spiffxp commented Aug 7, 2020 •

edited

Loading

spiffxp commented Aug 22, 2020

spiffxp commented Aug 22, 2020

spiffxp commented Aug 26, 2020

spiffxp commented Aug 26, 2020

BenTheElder commented Aug 28, 2020 •

edited

Loading

fejta-bot commented Nov 26, 2020

fejta-bot commented Dec 26, 2020

fejta-bot commented Jan 25, 2021

k8s-ci-robot commented Jan 25, 2021

go.k8s.io/triage serving stale data, triage update job timing out at 2h #17625

go.k8s.io/triage serving stale data, triage update job timing out at 2h #17625

Comments

spiffxp commented May 18, 2020

aojea commented May 18, 2020

dims commented May 18, 2020

spiffxp commented May 19, 2020

spiffxp commented May 19, 2020

spiffxp commented May 20, 2020

dims commented May 21, 2020

aojea commented May 21, 2020

spiffxp commented Jun 4, 2020 • edited Loading

spiffxp commented Jun 4, 2020

spiffxp commented Jul 20, 2020

spiffxp commented Jul 21, 2020

spiffxp commented Aug 7, 2020 • edited Loading

spiffxp commented Aug 22, 2020

spiffxp commented Aug 22, 2020

spiffxp commented Aug 26, 2020

spiffxp commented Aug 26, 2020

BenTheElder commented Aug 28, 2020 • edited Loading

fejta-bot commented Nov 26, 2020

fejta-bot commented Dec 26, 2020

fejta-bot commented Jan 25, 2021

k8s-ci-robot commented Jan 25, 2021

spiffxp commented Jun 4, 2020 •

edited

Loading

spiffxp commented Aug 7, 2020 •

edited

Loading

BenTheElder commented Aug 28, 2020 •

edited

Loading