Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise k8s-infra-prow-build cluster nodepool max size #1231

Closed
spiffxp opened this issue Sep 10, 2020 · 7 comments
Closed

Raise k8s-infra-prow-build cluster nodepool max size #1231

spiffxp opened this issue Sep 10, 2020 · 7 comments
Assignees
Labels
area/prow Setting up or working with prow in general, prow.k8s.io, prow build clusters sig/testing Categorizes an issue or PR as relevant to SIG Testing.

Comments

@spiffxp
Copy link
Member

spiffxp commented Sep 10, 2020

Well, it took about two weeks from the last "let's wait and see" comment (ref: #1132 (comment))

This is a regional cluster spread across 3 zones, with a nodepool that will autoscale up to 30 nodes per zone. So, 90 nodes total. We've hit that limit a few times now.

Metrics explorer: VM instance uptime (count)

Over the last month
Screen Shot 2020-09-09 at 6 31 11 PM

Let's zoom in on some of those peaks
Screen Shot 2020-09-09 at 6 33 29 PM

Enhance
Screen Shot 2020-09-09 at 6 34 32 PM

I'm not sure if there is an alert or a log line I could be searching for to show me exactly how often this occurs, but it's happening. I think this would result in more jobs hitting "error" state, since they can't find someplace to schedule to.

The plank graph for the past two weeks shows a few increases in jobs hitting error state (noon thursday is pretty prominent), but nothing catastrophic. Again though, this is based on CR's, not discrete events in time, so it's unclear to me how many prowjob CR's are being added/removed at any given time.
Screen Shot 2020-09-09 at 6 43 24 PM

@spiffxp
Copy link
Member Author

spiffxp commented Sep 10, 2020

Given that the build cluster we are supposed to be replacing is 160 nodes (plus whatever capacity RBE was offering), and we still have some critical kubernetes/kubernetes jobs to move over, I think we should raise max nodepool size from 90 (3x30) to 150 (3x50)

@spiffxp
Copy link
Member Author

spiffxp commented Sep 10, 2020

/assign

Part of kubernetes/test-infra#18550, need more capacity to feel confident we've got room for the rest of the jobs being migrated over

@spiffxp
Copy link
Member Author

spiffxp commented Sep 10, 2020

/area prow
/sig testing
/wg k8s-infra

@k8s-ci-robot k8s-ci-robot added area/prow Setting up or working with prow in general, prow.k8s.io, prow build clusters sig/testing Categorizes an issue or PR as relevant to SIG Testing. wg/k8s-infra labels Sep 10, 2020
@spiffxp
Copy link
Member Author

spiffxp commented Sep 10, 2020

Other quotas may need to be bumped to accomodate this:

  • 3 * 50 * 8 = 1200 CPU's
  • 3 * 30 * 500 = 75000 Gi SSD capacity
  • 3 * 50 * 1 = 150 in-use IP addresses

Per https://console.cloud.google.com/iam-admin/quotas?project=k8s-infra-prow-build quotas for us-central1 are now at:

  • 1440 CPUs
  • 81920 GB SSD
  • 150 in-use IP addresses

The IP's could stand to be raised. The others we may want to raise if we want to try more cpu or more SSD for increased IOPS

  • 3 * 50 * 16 = 2400 CPU's for n1-highmem-16's
  • 3 * 50 * 834 = 125100 GB SSD

@spiffxp
Copy link
Member Author

spiffxp commented Sep 10, 2020

Quotas for us-central1 are now:

  • 2,500 CPUs
  • 130,000 GB SSD
  • 160 IPs

@spiffxp
Copy link
Member Author

spiffxp commented Sep 10, 2020

/close
Calling this done

@k8s-ci-robot
Copy link
Contributor

@spiffxp: Closing this issue.

In response to this:

/close
Calling this done

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/prow Setting up or working with prow in general, prow.k8s.io, prow build clusters sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Projects
None yet
Development

No branches or pull requests

2 participants