-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kettle appears to be down #23135
Comments
for kettle staging:
|
from kettle:
|
Based on the logs, also noting that we are processing builds from buckets we don't control / are not part of the project like |
Is it possible that since test size has increased, we are no longer clearing out old data efficiently enough to make space for incoming test runs? Is there a manual step we can take to free up disk space and gain access to recent testing data? |
The staging instance should not be affecting the main instance, they have their own disks. I don't see logs after |
The pod was crated on the 31st, FWIW.
|
I deleted the pod since it seems to be stuck inexplicably, there is a new pod running. Will check back later. |
Still seeing data locked at 7/28 |
It still appears to be inserting builds in the podlogs.
|
I think it's going to take some time to catchup with the backlog. |
per @spiffxp this can take 8+ hours on restart 🔥 #17069 I think this tooling pretty obviously needs investing in, but the best I can offer just this moment is collecting that evidence and attempting to at least get it running again for now. even once it is ready again, it will be another hour or two longer at least for triage to process the new data (which runs as a periodic CI job) |
we've hit the same problem and hung, I think this is the issue:
|
/subscribe |
My team has an internal "summit" the next ~two days and I've needed to prepare for that so I'm not sure if I'll make any progress on this before then at least (previously took this as our oncall but we don't formally support this and E_TOO_MANY_THINGS ...). We're probably going to need to patch the kettle source to drop/truncate excessively large failure messages (and ideally log which ones so we can look into what is uploading > 100 MB failure logs for us to analyze ...), get that deployed, then wait a day or so for kettle to catch up, then another hour or two for triage to catch up from there ... |
/cc |
@BenTheElder do you need help on this? |
Sorry, I have not been able to work on this or much else |
/assign |
Sorry I meant the |
@BenTheElder I think the issue comes from the json format: https://stackoverflow.com/questions/51300674/converting-json-into-newline-delimited-json-in-python/ Could we try to add
|
... wait, looks like someone has fixed kettle?
|
I reached out to @MushuEE and he expanded capacity on prod, and implemented a CL I staged with your recommended changes @matthyx. Currently experiencing authorization errors, but the dashboard has populated up to 9/3 so hopefully the fix stays good. |
Just linked #23460, because kettle was dead I built this change from dirty and deployed to staging. Looked like no further breakage so I pushed to prod. Seemed to make it past the gcs crawl fine but hit permissions error in PubSub step. Likely due to IAM changes. I will create and issue soon for that. |
Awesome stuff, congrats to all! Could you send me the logs to identify which lines are skipped because of being too long? |
sigh: https://testgrid.k8s.io/sig-testing-misc#metrics-kettle&width=5 I'm still not going to be able to look further at this, I'm doing a quick issue sweep but then I have meetings and some other work at $employer first. It seems pretty clear that it's down again after a day of uptime.
|
Yes I think @MushuEE is looking at some permission problems. |
xref: #23678 |
I don't have full context on the issues with Kettle, but I'm switching the credentials to use Workload Identity rather than service account keys:
|
Removing the service account env var broke the job because there were args for |
Thank you! I see https://go.k8s.io/triage is loading again with recent results and the kettle jobs in https://testgrid.k8s.io/sig-testing-misc#Summary&width=5 mostly look good. |
Good work folks :-) |
I think we can close this out now with nearly a week of uptime, clearly Kettle still needs work to avoid day-long bringups but it is back up and running 🎉 Thanks all! |
thank you!!! |
What happened:
What you expected to happen:
kettle should be up
How to reproduce it (as minimally and precisely as possible):
....
Please provide links to example occurrences, if any:
see above
Anything else we need to know?:
probably we should have alerting for the kettle-metrics job
The text was updated successfully, but these errors were encountered: