Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kettle appears to be down #23135

Closed
BenTheElder opened this issue Aug 4, 2021 · 41 comments · Fixed by #23323
Closed

kettle appears to be down #23135

BenTheElder opened this issue Aug 4, 2021 · 41 comments · Fixed by #23323
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. sig/testing Categorizes an issue or PR as relevant to SIG Testing.

Comments

@BenTheElder
Copy link
Member

What happened:

What you expected to happen:

kettle should be up

How to reproduce it (as minimally and precisely as possible):

....

Please provide links to example occurrences, if any:

see above

Anything else we need to know?:

probably we should have alerting for the kettle-metrics job

@BenTheElder
Copy link
Member Author

for kettle staging:

2021-08-04 15:06:42.135 PDTTraceback (most recent call last): File "make_db.py", line 376, in OPTIONS.buildlimit, File "make_db.py", line 330, in main threads, client_class, build_limit) File "make_db.py", line 269, in get_all_builds db.insert_build(build_dir, started, finished) File "/kettle/model.py", line 88, in insert_build self.db.execute('insert into build_junit_missing values(?)', (rowid,)) sqlite3.OperationalError: database or disk is full

@BenTheElder
Copy link
Member Author

from kettle:

2021-07-31 16:45:21.465 PDTb'Error while reading data, error message: JSON parsing error in row starting at position 1103117119: Parser terminated before end of string'

@BenTheElder
Copy link
Member Author

Based on the logs, also noting that we are processing builds from buckets we don't control / are not part of the project like gs://istiocricle-ci, gs://k8s-conformance-gardener. Not sure we should be doing that.

@BenTheElder BenTheElder added the sig/testing Categorizes an issue or PR as relevant to SIG Testing. label Aug 4, 2021
@jdnurme
Copy link

jdnurme commented Aug 4, 2021

Is it possible that since test size has increased, we are no longer clearing out old data efficiently enough to make space for incoming test runs? Is there a manual step we can take to free up disk space and gain access to recent testing data?

@BenTheElder
Copy link
Member Author

BenTheElder commented Aug 4, 2021

The staging instance should not be affecting the main instance, they have their own disks.
The main instance has no errors about disk space and is on a disk an order of magnitude larger.

I don't see logs after 2021-07-31 16:45:21.465 PDT however, which seems odd.

@BenTheElder
Copy link
Member Author

The pod was crated on the 31st, FWIW.

Revision Name Status Restarts Created on
68 kettle-67d95d4546-9l64g Running 0 Jul 31, 2021, 11:40:44 AM

@BenTheElder
Copy link
Member Author

I deleted the pod since it seems to be stuck inexplicably, there is a new pod running. Will check back later.

@jdnurme
Copy link

jdnurme commented Aug 5, 2021

I deleted the pod since it seems to be stuck inexplicably, there is a new pod running. Will check back later.

Still seeing data locked at 7/28

@BenTheElder
Copy link
Member Author

BenTheElder commented Aug 5, 2021

It still appears to be inserting builds in the podlogs.

2021-08-05 14:36:30.078 PDT inserting build: gs://istio-circleci/e2e-pilot-auth-v1alpha3-v2/444884
2021-08-05 14:36:30.078 PDT inserting build: gs://istio-circleci/e2e-pilot-auth-v1alpha3-v2/444883
2021-08-05 14:36:30.078 PDT inserting build: gs://istio-circleci/e2e-pilot-auth-v1alpha3-v2/444882

@BenTheElder
Copy link
Member Author

I think it's going to take some time to catchup with the backlog.

@BenTheElder
Copy link
Member Author

per @spiffxp this can take 8+ hours on restart 🔥 #17069

I think this tooling pretty obviously needs investing in, but the best I can offer just this moment is collecting that evidence and attempting to at least get it running again for now.

even once it is ready again, it will be another hour or two longer at least for triage to process the new data (which runs as a periodic CI job)

@BenTheElder
Copy link
Member Author

we've hit the same problem and hung, I think this is the issue:

2021-08-05 16:45:24.551 PDT Warnings encountered during job execution:
2021-08-05 16:45:24.551 PDT
2021-08-05 16:45:24.551 PDT b'Error while reading data, error message: JSON parsing error in row starting at position 393911182: Row size is larger than: 104857600.'

@BenTheElder BenTheElder self-assigned this Aug 6, 2021
@liggitt
Copy link
Member

liggitt commented Aug 10, 2021

/subscribe

@BenTheElder
Copy link
Member Author

BenTheElder commented Aug 11, 2021

My team has an internal "summit" the next ~two days and I've needed to prepare for that so I'm not sure if I'll make any progress on this before then at least (previously took this as our oncall but we don't formally support this and E_TOO_MANY_THINGS ...).

We're probably going to need to patch the kettle source to drop/truncate excessively large failure messages (and ideally log which ones so we can look into what is uploading > 100 MB failure logs for us to analyze ...), get that deployed, then wait a day or so for kettle to catch up, then another hour or two for triage to catch up from there ...

@aojea
Copy link
Member

aojea commented Aug 17, 2021

/cc

@matthyx
Copy link
Contributor

matthyx commented Aug 18, 2021

@BenTheElder do you need help on this?

@BenTheElder
Copy link
Member Author

Sorry, I have not been able to work on this or much else
/unassign

@matthyx
Copy link
Contributor

matthyx commented Aug 20, 2021

/assign

@BenTheElder
Copy link
Member Author

null__logs__2021-08-23T05-56.json.txt

@matthyx
Copy link
Contributor

matthyx commented Aug 24, 2021

So this is the log of bq load? Is it possible to download the json file from gcloud?

Sorry I meant the build_<table>.json.gz that has "invalid" format... I must understand what's causing it (I assume it's big).

@matthyx
Copy link
Contributor

matthyx commented Aug 26, 2021

@BenTheElder I think the issue comes from the json format: https://stackoverflow.com/questions/51300674/converting-json-into-newline-delimited-json-in-python/

Could we try to add jq to the command:

pypy3 make_json.py | jq -c '.[]' | pv | gzip > build_<table>.json.gz

@matthyx
Copy link
Contributor

matthyx commented Sep 6, 2021

... wait, looks like someone has fixed kettle?

Last modified Sep 4, 2021, 3:43:02 AM UTC+2

@jdnurme
Copy link

jdnurme commented Sep 7, 2021

... wait, looks like someone has fixed kettle?

Last modified Sep 4, 2021, 3:43:02 AM UTC+2

I reached out to @MushuEE and he expanded capacity on prod, and implemented a CL I staged with your recommended changes @matthyx. Currently experiencing authorization errors, but the dashboard has populated up to 9/3 so hopefully the fix stays good.

@MushuEE
Copy link
Contributor

MushuEE commented Sep 7, 2021

Just linked #23460, because kettle was dead I built this change from dirty and deployed to staging. Looked like no further breakage so I pushed to prod. Seemed to make it past the gcs crawl fine but hit permissions error in PubSub step. Likely due to IAM changes. I will create and issue soon for that.

@matthyx
Copy link
Contributor

matthyx commented Sep 7, 2021

Awesome stuff, congrats to all!

Could you send me the logs to identify which lines are skipped because of being too long?
I will work on a follow up PR to truncate some parts and still have the results indexed... But I need will need some real world inputs for my tests.

@BenTheElder
Copy link
Member Author

sigh: https://testgrid.k8s.io/sig-testing-misc#metrics-kettle&width=5

I'm still not going to be able to look further at this, I'm doing a quick issue sweep but then I have meetings and some other work at $employer first.

It seems pretty clear that it's down again after a day of uptime.

I0910 16:23:43.852] ERROR: table k8s-gubernator:build.all is 158.7 hours old. Max allowed: 6 hours.

@matthyx
Copy link
Contributor

matthyx commented Sep 10, 2021

Yes I think @MushuEE is looking at some permission problems.

@BenTheElder
Copy link
Member Author

xref: #23678

@cjwagner
Copy link
Member

I don't have full context on the issues with Kettle, but I'm switching the credentials to use Workload Identity rather than service account keys:

  • The kettle staging and prod deployments have been switched and seem to be working.
  • I've set up a KSA for the metrics-kettle job to use and confirmed that the KSA properly authenticates to the GSA that the job has been using. This PR switches the job to use WI: Switch metrics-kettle job to use Workload Identity. #23698

@cjwagner
Copy link
Member

Removing the service account env var broke the job because there were args for bootstrap.py specified in two places...
I migrated the job to pod utils and its working again and now using WI: #23748

@BenTheElder
Copy link
Member Author

Thank you! I see https://go.k8s.io/triage is loading again with recent results and the kettle jobs in https://testgrid.k8s.io/sig-testing-misc#Summary&width=5 mostly look good.

@BenTheElder BenTheElder added the lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. label Sep 25, 2021
@matthyx
Copy link
Contributor

matthyx commented Sep 25, 2021

Good work folks :-)
Lemme know if I can help somehow...

@BenTheElder
Copy link
Member Author

I think we can close this out now with nearly a week of uptime, clearly Kettle still needs work to avoid day-long bringups but it is back up and running 🎉

Thanks all!

@ehashman
Copy link
Member

thank you!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants