Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test result ingestion into BigQuery via Kettle is hiccuping increasingly frequently #13432

Closed
spiffxp opened this issue Jul 12, 2019 · 23 comments
Assignees
Labels
area/kettle area/metrics kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/testing Categorizes an issue or PR as relevant to SIG Testing.

Comments

@spiffxp
Copy link
Member

spiffxp commented Jul 12, 2019

What happened:
http://velodrome.k8s.io/dashboard/db/bigquery-metrics?panelId=12&fullscreen&orgId=1&from=now-6M&to=now

Lots of alerts happening. We've only had one or two weeks in Q2 where an alert hasn't fired.

What you expected to happen:

No alerts to fire.

How to reproduce it (as minimally and precisely as possible):

Please provide links to example occurrences, if any:

Anything else we need to know?:

I suspect this isn't due to Kettle being written in python, but rather the pattern that Kettle follows. It may require a redesign or rewrite but I'm not necessarily sure it's mandated that we switch languages for performance. I suspect i/o is our problem here (due to unbounded growth of a sqlite database)

/area kettle
/area metrics

@spiffxp spiffxp added the kind/bug Categorizes issue or PR as related to a bug. label Jul 12, 2019
@spiffxp
Copy link
Member Author

spiffxp commented Jul 12, 2019

/sig testing

@k8s-ci-robot k8s-ci-robot added the sig/testing Categorizes an issue or PR as relevant to SIG Testing. label Jul 12, 2019
@spiffxp
Copy link
Member Author

spiffxp commented Aug 14, 2019

/priority important-soon
I am concerned this is going to tip over at some point

Screen Shot 2019-08-14 at 1 05 16 PM

@k8s-ci-robot k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Aug 14, 2019
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 12, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 12, 2019
@spiffxp
Copy link
Member Author

spiffxp commented Dec 12, 2019

/remove-lifecycle rotten
It's still hiccuping, but alerts are no longer firing as often
Screen Shot 2019-12-12 at 3 03 38 PM

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Dec 12, 2019
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 11, 2020
@spiffxp
Copy link
Member Author

spiffxp commented Mar 12, 2020

/remove-lifecycle stale
/close

@k8s-ci-robot
Copy link
Contributor

@spiffxp: Closing this issue.

In response to this:

/remove-lifecycle stale
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 12, 2020
@spiffxp
Copy link
Member Author

spiffxp commented Aug 22, 2020

/reopen

We no longer have velodrome. But we do have a job that fails if k8s-gubernator:build.all is stale: https://testgrid.k8s.io/sig-testing-misc#metrics-kettle&width=5

Screen Shot 2020-08-21 at 6 17 25 PM

Most recent example

This suggests kettle is pretty consistently not updating bigquery from 1am - 12pm PT, which means go.k8s.io/triage is serving stale data during CET working hours

@k8s-ci-robot
Copy link
Contributor

@spiffxp: Reopened this issue.

In response to this:

/reopen

We no longer have velodrome. But we do have a job that fails if k8s-gubernator:build.all is stale: https://testgrid.k8s.io/sig-testing-misc#metrics-kettle&width=5

Screen Shot 2020-08-21 at 6 17 25 PM

Most recent example

This suggests kettle is pretty consistently not updating bigquery from 1am - 12pm PT, which means go.k8s.io/triage is serving stale data during CET working hours

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@spiffxp
Copy link
Member Author

spiffxp commented Aug 22, 2020

FYI @MushuEE

@MushuEE
Copy link
Contributor

MushuEE commented Aug 24, 2020

@spiffxp thanks, will start looking into it.
/assign

@MushuEE
Copy link
Contributor

MushuEE commented Sep 11, 2020

Span is increasing deltas, not sure if kettle is taking longer to process all the data or not https://k8s-testgrid.appspot.com/sig-testing-misc#metrics-kettle

It might be worth it to start publishing the time that steps take to run within commands. Will add logging shortly

@MushuEE
Copy link
Contributor

MushuEE commented Sep 11, 2020

 resource: {…}  
 severity: "ERROR"  
 textPayload: "Traceback (most recent call last):
  File "/kettle/update.py", line 59, in <module>
    main()
  File "/kettle/update.py", line 47, in main
    call(bq_cmd + bq_ext + ' k8s-gubernator:build.week build_week.json.gz schema.json')
  File "/kettle/update.py", line 25, in call
    raise OSError('invocation failed')
OSError: invocation failed
" 

Seeing fairly frequent failure on build.week calls. Though there is not much more context...

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 10, 2020
@MushuEE
Copy link
Contributor

MushuEE commented Dec 14, 2020

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 14, 2020
@spiffxp
Copy link
Member Author

spiffxp commented Feb 25, 2021

Still seeing hiccups, but the duration has slowed

@MushuEE
Copy link
Contributor

MushuEE commented Feb 25, 2021

Thanks, I will check on the instances

@MushuEE
Copy link
Contributor

MushuEE commented Feb 25, 2021

Does appear we had another lockup

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 26, 2021
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 25, 2021
@k8s-triage-robot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kettle area/metrics kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Projects
None yet
Development

No branches or pull requests

5 participants