Test result ingestion into BigQuery via Kettle is hiccuping increasingly frequently #13432

spiffxp · 2019-07-12T20:19:19Z

What happened:
http://velodrome.k8s.io/dashboard/db/bigquery-metrics?panelId=12&fullscreen&orgId=1&from=now-6M&to=now

Lots of alerts happening. We've only had one or two weeks in Q2 where an alert hasn't fired.

What you expected to happen:

No alerts to fire.

How to reproduce it (as minimally and precisely as possible):

Please provide links to example occurrences, if any:

Anything else we need to know?:

I suspect this isn't due to Kettle being written in python, but rather the pattern that Kettle follows. It may require a redesign or rewrite but I'm not necessarily sure it's mandated that we switch languages for performance. I suspect i/o is our problem here (due to unbounded growth of a sqlite database)

/area kettle
/area metrics

spiffxp · 2019-07-12T20:28:47Z

/sig testing

spiffxp · 2019-08-14T20:06:38Z

/priority important-soon
I am concerned this is going to tip over at some point

fejta-bot · 2019-11-12T21:02:49Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2019-12-12T21:46:19Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

spiffxp · 2019-12-12T23:04:03Z

/remove-lifecycle rotten
It's still hiccuping, but alerts are no longer firing as often

fejta-bot · 2020-03-11T23:58:10Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

spiffxp · 2020-03-12T01:34:29Z

/remove-lifecycle stale
/close

k8s-ci-robot · 2020-03-12T01:34:30Z

@spiffxp: Closing this issue.

In response to this:

/remove-lifecycle stale
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

spiffxp · 2020-08-22T01:27:34Z

/reopen

We no longer have velodrome. But we do have a job that fails if k8s-gubernator:build.all is stale: https://testgrid.k8s.io/sig-testing-misc#metrics-kettle&width=5

Most recent example

at 7:33am PT k8s-gubernator:build.all is >6h old
at 11:37am PT k8s-gubernator:build.all is still >6h old
by 12:38pm PT k8s-gubernator:build.all is fresh again

This suggests kettle is pretty consistently not updating bigquery from 1am - 12pm PT, which means go.k8s.io/triage is serving stale data during CET working hours

k8s-ci-robot · 2020-08-22T01:27:45Z

@spiffxp: Reopened this issue.

In response to this:

/reopen

We no longer have velodrome. But we do have a job that fails if k8s-gubernator:build.all is stale: https://testgrid.k8s.io/sig-testing-misc#metrics-kettle&width=5

Most recent example

at 7:33am PT k8s-gubernator:build.all is >6h old

at 11:37am PT k8s-gubernator:build.all is still >6h old

by 12:38pm PT k8s-gubernator:build.all is fresh again

This suggests kettle is pretty consistently not updating bigquery from 1am - 12pm PT, which means go.k8s.io/triage is serving stale data during CET working hours

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

spiffxp · 2020-08-22T01:30:06Z

FYI @MushuEE

MushuEE · 2020-08-24T04:46:53Z

@spiffxp thanks, will start looking into it.
/assign

MushuEE · 2020-09-11T20:15:47Z

Span is increasing deltas, not sure if kettle is taking longer to process all the data or not https://k8s-testgrid.appspot.com/sig-testing-misc#metrics-kettle

It might be worth it to start publishing the time that steps take to run within commands. Will add logging shortly

MushuEE · 2020-09-11T21:05:44Z

 resource: {…}  
 severity: "ERROR"  
 textPayload: "Traceback (most recent call last):
  File "/kettle/update.py", line 59, in <module>
    main()
  File "/kettle/update.py", line 47, in main
    call(bq_cmd + bq_ext + ' k8s-gubernator:build.week build_week.json.gz schema.json')
  File "/kettle/update.py", line 25, in call
    raise OSError('invocation failed')
OSError: invocation failed
"

Seeing fairly frequent failure on build.week calls. Though there is not much more context...

fejta-bot · 2020-12-10T21:48:58Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

MushuEE · 2020-12-14T06:21:37Z

/remove-lifecycle stale

spiffxp · 2021-02-25T02:53:15Z

Still seeing hiccups, but the duration has slowed

MushuEE · 2021-02-25T17:33:15Z

Thanks, I will check on the instances

MushuEE · 2021-02-25T17:36:43Z

Does appear we had another lockup

fejta-bot · 2021-05-26T18:06:08Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

fejta-bot · 2021-06-25T18:39:57Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

k8s-triage-robot · 2021-07-26T22:17:38Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

k8s-ci-robot · 2021-07-26T22:17:44Z

@k8s-triage-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

spiffxp added the kind/bug Categorizes issue or PR as related to a bug. label Jul 12, 2019

k8s-ci-robot added area/kettle area/metrics labels Jul 12, 2019

k8s-ci-robot added the sig/testing Categorizes an issue or PR as relevant to SIG Testing. label Jul 12, 2019

k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Aug 14, 2019

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 12, 2019

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 12, 2019

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Dec 12, 2019

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 11, 2020

k8s-ci-robot closed this as completed Mar 12, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 12, 2020

spiffxp mentioned this issue Jul 15, 2020

go.k8s.io/triage is serving stale data #17069

Closed

k8s-ci-robot reopened this Aug 22, 2020

spiffxp mentioned this issue Aug 22, 2020

go.k8s.io/triage serving stale data, triage update job timing out at 2h #17625

Closed

k8s-ci-robot assigned MushuEE Aug 24, 2020

MushuEE mentioned this issue Sep 11, 2020

Add call timing and error dumping to kettle #19195

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 10, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 14, 2020

MushuEE mentioned this issue Mar 4, 2021

k8s-gubernator:build dataset is stale #20599

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 26, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 25, 2021

k8s-ci-robot closed this as completed Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test result ingestion into BigQuery via Kettle is hiccuping increasingly frequently #13432

Test result ingestion into BigQuery via Kettle is hiccuping increasingly frequently #13432

spiffxp commented Jul 12, 2019

spiffxp commented Jul 12, 2019

spiffxp commented Aug 14, 2019

fejta-bot commented Nov 12, 2019

fejta-bot commented Dec 12, 2019

spiffxp commented Dec 12, 2019

fejta-bot commented Mar 11, 2020

spiffxp commented Mar 12, 2020

k8s-ci-robot commented Mar 12, 2020

spiffxp commented Aug 22, 2020

k8s-ci-robot commented Aug 22, 2020

spiffxp commented Aug 22, 2020

MushuEE commented Aug 24, 2020

MushuEE commented Sep 11, 2020

MushuEE commented Sep 11, 2020

fejta-bot commented Dec 10, 2020

MushuEE commented Dec 14, 2020

spiffxp commented Feb 25, 2021

MushuEE commented Feb 25, 2021

MushuEE commented Feb 25, 2021

fejta-bot commented May 26, 2021

fejta-bot commented Jun 25, 2021

k8s-triage-robot commented Jul 26, 2021

k8s-ci-robot commented Jul 26, 2021

Test result ingestion into BigQuery via Kettle is hiccuping increasingly frequently #13432

Test result ingestion into BigQuery via Kettle is hiccuping increasingly frequently #13432

Comments

spiffxp commented Jul 12, 2019

spiffxp commented Jul 12, 2019

spiffxp commented Aug 14, 2019

fejta-bot commented Nov 12, 2019

fejta-bot commented Dec 12, 2019

spiffxp commented Dec 12, 2019

fejta-bot commented Mar 11, 2020

spiffxp commented Mar 12, 2020

k8s-ci-robot commented Mar 12, 2020

spiffxp commented Aug 22, 2020

k8s-ci-robot commented Aug 22, 2020

spiffxp commented Aug 22, 2020

MushuEE commented Aug 24, 2020

MushuEE commented Sep 11, 2020

MushuEE commented Sep 11, 2020

fejta-bot commented Dec 10, 2020

MushuEE commented Dec 14, 2020

spiffxp commented Feb 25, 2021

MushuEE commented Feb 25, 2021

MushuEE commented Feb 25, 2021

fejta-bot commented May 26, 2021

fejta-bot commented Jun 25, 2021

k8s-triage-robot commented Jul 26, 2021

k8s-ci-robot commented Jul 26, 2021