kettle appears to be down #23135

BenTheElder · 2021-08-04T22:20:31Z

What happened:

go.k8s.io/triage results only extend to july 28th
triage job is passing https://testgrid.k8s.io/sig-testing-misc#triage&graph-metrics=test-duration-minutes&width=5
kettle-metrics job is failing https://testgrid.k8s.io/sig-testing-misc#metrics-kettle&width=5
kettle-staging deployment is crashloopbackoff due to out of disk
kettle deployment still appears to be up

What you expected to happen:

kettle should be up

How to reproduce it (as minimally and precisely as possible):

....

Please provide links to example occurrences, if any:

see above

Anything else we need to know?:

probably we should have alerting for the kettle-metrics job

BenTheElder · 2021-08-04T22:21:58Z

for kettle staging:

2021-08-04 15:06:42.135 PDTTraceback (most recent call last): File "make_db.py", line 376, in OPTIONS.buildlimit, File "make_db.py", line 330, in main threads, client_class, build_limit) File "make_db.py", line 269, in get_all_builds db.insert_build(build_dir, started, finished) File "/kettle/model.py", line 88, in insert_build self.db.execute('insert into build_junit_missing values(?)', (rowid,)) sqlite3.OperationalError: database or disk is full

BenTheElder · 2021-08-04T22:23:25Z

from kettle:

2021-07-31 16:45:21.465 PDTb'Error while reading data, error message: JSON parsing error in row starting at position 1103117119: Parser terminated before end of string'

BenTheElder · 2021-08-04T22:25:11Z

Based on the logs, also noting that we are processing builds from buckets we don't control / are not part of the project like gs://istiocricle-ci, gs://k8s-conformance-gardener. Not sure we should be doing that.

jdnurme · 2021-08-04T23:02:34Z

Is it possible that since test size has increased, we are no longer clearing out old data efficiently enough to make space for incoming test runs? Is there a manual step we can take to free up disk space and gain access to recent testing data?

BenTheElder · 2021-08-04T23:03:47Z

The staging instance should not be affecting the main instance, they have their own disks.
The main instance has no errors about disk space and is on a disk an order of magnitude larger.

I don't see logs after 2021-07-31 16:45:21.465 PDT however, which seems odd.

BenTheElder · 2021-08-04T23:14:02Z

The pod was crated on the 31st, FWIW.

Revision	Name	Status	Restarts	Created on
68	kettle-67d95d4546-9l64g	Running	0	Jul 31, 2021, 11:40:44 AM

BenTheElder · 2021-08-05T19:16:23Z

I deleted the pod since it seems to be stuck inexplicably, there is a new pod running. Will check back later.

jdnurme · 2021-08-05T21:35:11Z

I deleted the pod since it seems to be stuck inexplicably, there is a new pod running. Will check back later.

Still seeing data locked at 7/28

BenTheElder · 2021-08-05T21:36:53Z

It still appears to be inserting builds in the podlogs.

2021-08-05 14:36:30.078 PDT inserting build: gs://istio-circleci/e2e-pilot-auth-v1alpha3-v2/444884
2021-08-05 14:36:30.078 PDT inserting build: gs://istio-circleci/e2e-pilot-auth-v1alpha3-v2/444883
2021-08-05 14:36:30.078 PDT inserting build: gs://istio-circleci/e2e-pilot-auth-v1alpha3-v2/444882

BenTheElder · 2021-08-05T21:38:49Z

I think it's going to take some time to catchup with the backlog.

BenTheElder · 2021-08-05T23:35:59Z

per @spiffxp this can take 8+ hours on restart 🔥 #17069

I think this tooling pretty obviously needs investing in, but the best I can offer just this moment is collecting that evidence and attempting to at least get it running again for now.

even once it is ready again, it will be another hour or two longer at least for triage to process the new data (which runs as a periodic CI job)

BenTheElder · 2021-08-06T00:07:35Z

we've hit the same problem and hung, I think this is the issue:

2021-08-05 16:45:24.551 PDT Warnings encountered during job execution:
2021-08-05 16:45:24.551 PDT
2021-08-05 16:45:24.551 PDT b'Error while reading data, error message: JSON parsing error in row starting at position 393911182: Row size is larger than: 104857600.'

liggitt · 2021-08-10T21:14:23Z

/subscribe

BenTheElder · 2021-08-11T05:53:41Z

My team has an internal "summit" the next ~two days and I've needed to prepare for that so I'm not sure if I'll make any progress on this before then at least (previously took this as our oncall but we don't formally support this and E_TOO_MANY_THINGS ...).

We're probably going to need to patch the kettle source to drop/truncate excessively large failure messages (and ideally log which ones so we can look into what is uploading > 100 MB failure logs for us to analyze ...), get that deployed, then wait a day or so for kettle to catch up, then another hour or two for triage to catch up from there ...

aojea · 2021-08-17T11:06:15Z

/cc

matthyx · 2021-08-18T08:44:48Z

@BenTheElder do you need help on this?

BenTheElder · 2021-08-19T22:28:50Z

Sorry, I have not been able to work on this or much else
/unassign

matthyx · 2021-08-20T08:04:29Z

/assign

BenTheElder · 2021-08-24T19:50:47Z

null__logs__2021-08-23T05-56.json.txt

matthyx · 2021-08-24T19:58:21Z

So this is the log of bq load? Is it possible to download the json file from gcloud?

Sorry I meant the build_<table>.json.gz that has "invalid" format... I must understand what's causing it (I assume it's big).

matthyx · 2021-08-26T06:41:13Z

@BenTheElder I think the issue comes from the json format: https://stackoverflow.com/questions/51300674/converting-json-into-newline-delimited-json-in-python/

Could we try to add jq to the command:

pypy3 make_json.py | jq -c '.[]' | pv | gzip > build_<table>.json.gz

matthyx · 2021-09-06T11:50:10Z

... wait, looks like someone has fixed kettle?

Last modified Sep 4, 2021, 3:43:02 AM UTC+2

jdnurme · 2021-09-07T17:46:10Z

... wait, looks like someone has fixed kettle?
Last modified Sep 4, 2021, 3:43:02 AM UTC+2

I reached out to @MushuEE and he expanded capacity on prod, and implemented a CL I staged with your recommended changes @matthyx. Currently experiencing authorization errors, but the dashboard has populated up to 9/3 so hopefully the fix stays good.

MushuEE · 2021-09-07T18:17:32Z

Just linked #23460, because kettle was dead I built this change from dirty and deployed to staging. Looked like no further breakage so I pushed to prod. Seemed to make it past the gcs crawl fine but hit permissions error in PubSub step. Likely due to IAM changes. I will create and issue soon for that.

matthyx · 2021-09-07T19:35:22Z

Awesome stuff, congrats to all!

Could you send me the logs to identify which lines are skipped because of being too long?
I will work on a follow up PR to truncate some parts and still have the results indexed... But I need will need some real world inputs for my tests.

BenTheElder · 2021-09-10T17:52:43Z

sigh: https://testgrid.k8s.io/sig-testing-misc#metrics-kettle&width=5

I'm still not going to be able to look further at this, I'm doing a quick issue sweep but then I have meetings and some other work at $employer first.

It seems pretty clear that it's down again after a day of uptime.

I0910 16:23:43.852] ERROR: table k8s-gubernator:build.all is 158.7 hours old. Max allowed: 6 hours.

matthyx · 2021-09-10T17:54:51Z

Yes I think @MushuEE is looking at some permission problems.

BenTheElder · 2021-09-21T22:18:40Z

xref: #23678

cjwagner · 2021-09-21T23:08:12Z

I don't have full context on the issues with Kettle, but I'm switching the credentials to use Workload Identity rather than service account keys:

The kettle staging and prod deployments have been switched and seem to be working.
I've set up a KSA for the metrics-kettle job to use and confirmed that the KSA properly authenticates to the GSA that the job has been using. This PR switches the job to use WI: Switch metrics-kettle job to use Workload Identity. #23698

cjwagner · 2021-09-24T23:30:20Z

Removing the service account env var broke the job because there were args for bootstrap.py specified in two places...
I migrated the job to pod utils and its working again and now using WI: #23748

BenTheElder · 2021-09-25T00:13:14Z

Thank you! I see https://go.k8s.io/triage is loading again with recent results and the kettle jobs in https://testgrid.k8s.io/sig-testing-misc#Summary&width=5 mostly look good.

matthyx · 2021-09-25T19:55:01Z

Good work folks :-)
Lemme know if I can help somehow...

BenTheElder · 2021-09-27T16:32:37Z

I think we can close this out now with nearly a week of uptime, clearly Kettle still needs work to avoid day-long bringups but it is back up and running 🎉

Thanks all!

ehashman · 2021-09-27T20:26:50Z

thank you!!!

BenTheElder added the kind/bug Categorizes issue or PR as related to a bug. label Aug 4, 2021

BenTheElder mentioned this issue Aug 4, 2021

triage: go.k8s.io/triage fails to render in Google Chrome due to excessively large cluster data #22906

Closed

BenTheElder added the sig/testing Categorizes an issue or PR as relevant to SIG Testing. label Aug 4, 2021

BenTheElder self-assigned this Aug 6, 2021

spiffxp mentioned this issue Aug 16, 2021

Migrate triage dashboard to wg-k8s-infra kubernetes/k8s.io#1305

Closed

BenTheElder mentioned this issue Aug 19, 2021

InternalError when loading jobs with large tables GoogleCloudPlatform/testgrid#624

Closed

k8s-ci-robot unassigned BenTheElder Aug 19, 2021

k8s-ci-robot assigned matthyx Aug 20, 2021

matthyx mentioned this issue Aug 20, 2021

kettle make_json: skip and log rows bigger than 100MB #23323

Merged

k8s-ci-robot closed this as completed in #23323 Aug 20, 2021

MushuEE mentioned this issue Sep 7, 2021

updated to remove newline delimiter from json #23460

Closed

BenTheElder mentioned this issue Sep 10, 2021

Testgrid for pull-kubernetes-unit doesn't load #23425

Closed

BenTheElder closed this as completed Sep 10, 2021

BenTheElder assigned MushuEE and jdnurme Sep 10, 2021

BenTheElder reopened this Sep 10, 2021

BenTheElder assigned cjwagner Sep 25, 2021

BenTheElder added the lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. label Sep 25, 2021

BenTheElder closed this as completed Sep 27, 2021

BenTheElder mentioned this issue Jan 20, 2022

kettle is down again #24946

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kettle appears to be down #23135

kettle appears to be down #23135

BenTheElder commented Aug 4, 2021

BenTheElder commented Aug 4, 2021

BenTheElder commented Aug 4, 2021

BenTheElder commented Aug 4, 2021

jdnurme commented Aug 4, 2021

BenTheElder commented Aug 4, 2021 •

edited

Loading

BenTheElder commented Aug 4, 2021

BenTheElder commented Aug 5, 2021

jdnurme commented Aug 5, 2021

BenTheElder commented Aug 5, 2021 •

edited

Loading

BenTheElder commented Aug 5, 2021

BenTheElder commented Aug 5, 2021

BenTheElder commented Aug 6, 2021

liggitt commented Aug 10, 2021

BenTheElder commented Aug 11, 2021 •

edited

Loading

aojea commented Aug 17, 2021

matthyx commented Aug 18, 2021

BenTheElder commented Aug 19, 2021

matthyx commented Aug 20, 2021

BenTheElder commented Aug 24, 2021

matthyx commented Aug 24, 2021

matthyx commented Aug 26, 2021 •

edited

Loading

matthyx commented Sep 6, 2021

jdnurme commented Sep 7, 2021 •

edited

Loading

MushuEE commented Sep 7, 2021

matthyx commented Sep 7, 2021

BenTheElder commented Sep 10, 2021

matthyx commented Sep 10, 2021

BenTheElder commented Sep 21, 2021

cjwagner commented Sep 21, 2021

cjwagner commented Sep 24, 2021

BenTheElder commented Sep 25, 2021

matthyx commented Sep 25, 2021

BenTheElder commented Sep 27, 2021

ehashman commented Sep 27, 2021

kettle appears to be down #23135

kettle appears to be down #23135

Comments

BenTheElder commented Aug 4, 2021

BenTheElder commented Aug 4, 2021

BenTheElder commented Aug 4, 2021

BenTheElder commented Aug 4, 2021

jdnurme commented Aug 4, 2021

BenTheElder commented Aug 4, 2021 • edited Loading

BenTheElder commented Aug 4, 2021

BenTheElder commented Aug 5, 2021

jdnurme commented Aug 5, 2021

BenTheElder commented Aug 5, 2021 • edited Loading

BenTheElder commented Aug 5, 2021

BenTheElder commented Aug 5, 2021

BenTheElder commented Aug 6, 2021

liggitt commented Aug 10, 2021

BenTheElder commented Aug 11, 2021 • edited Loading

aojea commented Aug 17, 2021

matthyx commented Aug 18, 2021

BenTheElder commented Aug 19, 2021

matthyx commented Aug 20, 2021

BenTheElder commented Aug 24, 2021

matthyx commented Aug 24, 2021

matthyx commented Aug 26, 2021 • edited Loading

matthyx commented Sep 6, 2021

jdnurme commented Sep 7, 2021 • edited Loading

MushuEE commented Sep 7, 2021

matthyx commented Sep 7, 2021

BenTheElder commented Sep 10, 2021

matthyx commented Sep 10, 2021

BenTheElder commented Sep 21, 2021

cjwagner commented Sep 21, 2021

cjwagner commented Sep 24, 2021

BenTheElder commented Sep 25, 2021

matthyx commented Sep 25, 2021

BenTheElder commented Sep 27, 2021

ehashman commented Sep 27, 2021

BenTheElder commented Aug 4, 2021 •

edited

Loading

BenTheElder commented Aug 5, 2021 •

edited

Loading

BenTheElder commented Aug 11, 2021 •

edited

Loading

matthyx commented Aug 26, 2021 •

edited

Loading

jdnurme commented Sep 7, 2021 •

edited

Loading