-
Notifications
You must be signed in to change notification settings - Fork 827
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Raise k8s-infra-prow-build cluster nodepool max size #1231
Comments
Given that the build cluster we are supposed to be replacing is 160 nodes (plus whatever capacity RBE was offering), and we still have some critical kubernetes/kubernetes jobs to move over, I think we should raise max nodepool size from 90 (3x30) to 150 (3x50) |
/assign Part of kubernetes/test-infra#18550, need more capacity to feel confident we've got room for the rest of the jobs being migrated over |
/area prow |
Other quotas may need to be bumped to accomodate this:
Per https://console.cloud.google.com/iam-admin/quotas?project=k8s-infra-prow-build quotas for us-central1 are now at:
The IP's could stand to be raised. The others we may want to raise if we want to try more cpu or more SSD for increased IOPS
|
Quotas for us-central1 are now:
|
/close |
@spiffxp: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Well, it took about two weeks from the last "let's wait and see" comment (ref: #1132 (comment))
This is a regional cluster spread across 3 zones, with a nodepool that will autoscale up to 30 nodes per zone. So, 90 nodes total. We've hit that limit a few times now.
Metrics explorer: VM instance uptime (count)
Over the last month
Let's zoom in on some of those peaks
Enhance
I'm not sure if there is an alert or a log line I could be searching for to show me exactly how often this occurs, but it's happening. I think this would result in more jobs hitting "error" state, since they can't find someplace to schedule to.
The plank graph for the past two weeks shows a few increases in jobs hitting error state (noon thursday is pretty prominent), but nothing catastrophic. Again though, this is based on CR's, not discrete events in time, so it's unclear to me how many prowjob CR's are being added/removed at any given time.
The text was updated successfully, but these errors were encountered: