Raise k8s-infra-prow-build quotas in anticipation of handling merge-blocking jobs #1132

spiffxp · 2020-08-11T01:02:12Z

The node pool is currently setup as 3 * (6 to 30) n1-highmem-8's. We don't have enough quota to hit max nodepool size

At least in terms of resources, we need at least:

3 * 30 * 8 = 720 CPU's
3 * 30 * 250 = 22500 Gi SSD capacity
3 * 30 = 90 in-use IP addresses

If we want to match the size of the k8s-prow-builds cluster, which has 160 nodes, we should ask for more

/wg k8s-infra
/area prow

The text was updated successfully, but these errors were encountered:

spiffxp · 2020-08-11T01:07:59Z

Submitted requests for:

1024 CPU's in us-central1
100 in-use IP addresses in us-central1

spiffxp · 2020-08-11T01:40:36Z

Well, the 1024 CPU request went through just fine.

The 100 in-use IP's...

Unfortunately, we are unable to grant you additional quota at this time. If this is a new project please wait 48h until you resubmit the request or until your Billing account has additional history.

So, I'll hold this open and see what comes back in two days. Quota is 69 in-use IP addresses until then.

spiffxp · 2020-08-11T01:42:12Z

Part of kubernetes/test-infra#18550

spiffxp · 2020-08-21T16:54:13Z

/priority critical-urgent
/assign @idvoretskyi @thockin

I have repeatedly tried to file for 100 in-use IP's and been rejected every time. We bumped into IP quota yesterday when autoscaling to handle PR traffic

I'm escalating because in the grand scheme of things our PR load looks pretty low, and and I anticipate will bump into this more once we see real traffic (opening up for v1.20)

There are some things we can do to workaround or address:

migrate jobs back to google.com k8s-prow-builds
migrate to a nodepool of larger nodes
- this seems like the wrong direction, I suspect we want more smaller nodes for i/o isolation
put the squeeze on job resources
- our cluster-level graphs make the cluster look really underutilized
- raising utilization may lead to more flakiness in the jobs that have migrated over
see if regions other than us-central1 would give us the ip quota we need
- may need to balance with cost?
setup more small build clusters
- can't share a boskos across build clusters, dedicate one for e2e's that need gcp projects
- could do a greenhouse instance per cluster (TODO: see what the bazel/non-bazel breakdown of jobs is)
- would probably take this opportunity to move away from regional clusters
try setting up a private cluster (https://cloud.google.com/kubernetes-engine/docs/how-to/private-clusters) to avoid external ip per node
- this would deviate from k8s-prow-builds' behavior
- NAT for outbound access might present some challenges

It would be really nice to be able to just raise our quota

thockin · 2020-08-21T17:10:48Z

I will see what else I can learn internally, but to the mitigations, I think some should be considered:

migrate to a nodepool of larger nodes

16 core is a good sweet-spot, I think we should try it

put the squeeze on job resources
our cluster-level graphs make the cluster look really underutilized

We should try this (slowly) - we don't want to be wasteful

see if regions other than us-central1 would give us the ip quota we need

I don't see why we would not do this anyway, just for sanity in case of failure.

would probably take this opportunity to move away from regional clusters

How would this affect the quota?

try setting up a private cluster (https://cloud.google.com/kubernetes-engine/docs/how-to/private-clusters) to avoid external ip per node

I think this is the real solution. I don't think we really need each node to have an IP anyway?

spiffxp · 2020-08-21T19:36:37Z

Submitted request for 40960GB SSD in us-central1 (quota claims we were hitting our 20480 quota), which was approved

16 core is a good sweet-spot, I think we should try it

I'll see if I can setup a pool2 nodepool on the existing cluster and shift things over during some quiet time.

I don't see why we would not do this anyway, just for sanity in case of failure.

Tried asking for 100 IP's in us-west1 and us-east1, both rejected.

I think this is the real solution. I don't think we really need each node to have an IP anyway?

I agree. I just anticipate it could bump into the most unknowns along the way, and my bandwidth is currently limited.

helenfeng737 · 2020-08-21T19:51:23Z

@spiffxp do we need to move some jobs out of that cluster while waiting?

thockin · 2020-08-21T19:53:18Z

Maybe we can't get 100 IPs in each, but could we spread the load between regions, so we get 50 in each?

…

On Fri, Aug 21, 2020 at 12:51 PM ZhiFeng_5160 ***@***.***> wrote: @spiffxp do we need to move some jobs out of that cluster while waiting? — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or unsubscribe.

spiffxp · 2020-08-21T19:59:39Z

Maybe we can't get 100 IPs in each, but could we spread the load between regions, so we get 50 in each?

This basically is the "setup more small build clusters" option, since each region would need its own regional build cluster anyway. This avoids setting up a new GCP project for each cluster though.

I'll look into it, we might be able to split up jobs in a way that makes sense

spiffxp · 2020-08-21T23:06:26Z

@ZhiFeng1993

do we need to move some jobs out of that cluster while waiting?

I would like to hold off on moving things away from community-accessible infra for now. Flipping back to k8s-prow-builds is a pretty quick change if we decide we have to move quickly and/or are out of options.

spiffxp · 2020-08-24T16:48:18Z

setup more small build clusters

I tried raising CPU and SSD quota in us-west1 to be able to create an equivalently sized build cluster in the same k8s-infra-prow-build project over there. Both requests automatically rejected.

spiffxp · 2020-08-24T18:51:17Z

There is suspicion that moving to n1-highmem-16's has actually increased flakiness. Specifically for these jobs, across release branches:

I have opened #1172 to start rolling back

spiffxp · 2020-08-24T18:57:18Z

Opened #1173 to track the rollback

spiffxp · 2020-08-25T16:28:03Z

setup more small build clusters

I was able to raise CPU quota in us-east1 to 1024, but was rejected for SSD and IP quota requests.

Next step would be to try raising quotas for a different GCP project, in case k8s-infra-prow-build has gotten flagged for some reason.

spiffxp · 2020-08-25T22:54:09Z

At least in terms of resources, we need at least:

3 * 30 * 8 = 720 CPU's

3 * 30 * 250 = 22500 Gi SSD capacity

3 * 30 = 90 in-use IP addresses

OK quota changes came through (thank you @thockin), I'm feeling better about our immediate capacity requirements being met in us-central1:

1024 CPUs
40960 GB SSD
150 in-use IP addresses

So now we'll be able to bump into our autoscaling limits at least

/remove-priority critical-urgent
/priority important-soon

spiffxp · 2020-08-25T23:07:38Z

try setting up a private cluster

I have broken this out into its own issue #1178

spiffxp · 2020-08-27T23:46:54Z

Quotas for us-central1 are now at:

1440 CPUs
81920 GB SSD
150 in-use IP addresses

Based on how things have been behaving today with v1.20 merges, I'm comfortable calling this done. We can open further issues as our needs evolve

/close

k8s-ci-robot · 2020-08-27T23:47:07Z

@spiffxp: Closing this issue.

In response to this:

Quotas for us-central1 are now at:

1440 CPUs

81200 GB SSD

150 in-use IP addresses

Based on how things have been behaving today with v1.20 merges, I'm comfortable calling this done. We can open further issues as our needs evolve

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added wg/k8s-infra area/prow Setting up or working with prow in general, prow.k8s.io, prow build clusters labels Aug 11, 2020

k8s-ci-robot assigned idvoretskyi and thockin Aug 21, 2020

k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Aug 21, 2020

spiffxp mentioned this issue Aug 21, 2020

Kubernetes CI Policy: merge-blocking jobs must run in dedicated cluster kubernetes/test-infra#18550

Closed

22 tasks

spiffxp mentioned this issue Aug 21, 2020

Add n1-highmem-16 nodepool to k8s-infra-prow-build #1167

Merged

spiffxp mentioned this issue Aug 21, 2020

Migrate k8s-infra-prow-build to n1-highmem-16 nodepool #1168

Closed

spiffxp mentioned this issue Aug 24, 2020

Add an n1_highmem_8 nodepool to k8s-infra-prow-build to prep for migration #1172

Merged

k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Aug 25, 2020

spiffxp mentioned this issue Aug 25, 2020

Investigate creating prow build cluster as a private build cluster #1178

Closed

k8s-ci-robot closed this as completed Aug 27, 2020

spiffxp mentioned this issue Aug 28, 2020

Migrate k8s-infra-prow-build to a nodepool with more IOPS #1187

Closed

spiffxp mentioned this issue Sep 10, 2020

Raise k8s-infra-prow-build cluster nodepool max size #1231

Closed

spiffxp mentioned this issue Oct 1, 2020

Request billing project quota increase and/or mitigate #852

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Raise k8s-infra-prow-build quotas in anticipation of handling merge-blocking jobs #1132

Raise k8s-infra-prow-build quotas in anticipation of handling merge-blocking jobs #1132

spiffxp commented Aug 11, 2020 •

edited

Loading

spiffxp commented Aug 11, 2020 •

edited

Loading

spiffxp commented Aug 11, 2020

spiffxp commented Aug 11, 2020

spiffxp commented Aug 21, 2020 •

edited

Loading

thockin commented Aug 21, 2020

spiffxp commented Aug 21, 2020

helenfeng737 commented Aug 21, 2020

thockin commented Aug 21, 2020 via email

spiffxp commented Aug 21, 2020 •

edited

Loading

spiffxp commented Aug 21, 2020

spiffxp commented Aug 24, 2020

spiffxp commented Aug 24, 2020 •

edited

Loading

spiffxp commented Aug 24, 2020

spiffxp commented Aug 25, 2020

spiffxp commented Aug 25, 2020

spiffxp commented Aug 25, 2020

spiffxp commented Aug 27, 2020 •

edited

Loading

k8s-ci-robot commented Aug 27, 2020

Raise k8s-infra-prow-build quotas in anticipation of handling merge-blocking jobs #1132

Raise k8s-infra-prow-build quotas in anticipation of handling merge-blocking jobs #1132

Comments

spiffxp commented Aug 11, 2020 • edited Loading

spiffxp commented Aug 11, 2020 • edited Loading

spiffxp commented Aug 11, 2020

spiffxp commented Aug 11, 2020

spiffxp commented Aug 21, 2020 • edited Loading

thockin commented Aug 21, 2020

spiffxp commented Aug 21, 2020

helenfeng737 commented Aug 21, 2020

thockin commented Aug 21, 2020 via email

spiffxp commented Aug 21, 2020 • edited Loading

spiffxp commented Aug 21, 2020

spiffxp commented Aug 24, 2020

spiffxp commented Aug 24, 2020 • edited Loading

spiffxp commented Aug 24, 2020

spiffxp commented Aug 25, 2020

spiffxp commented Aug 25, 2020

spiffxp commented Aug 25, 2020

spiffxp commented Aug 27, 2020 • edited Loading

k8s-ci-robot commented Aug 27, 2020

spiffxp commented Aug 11, 2020 •

edited

Loading

spiffxp commented Aug 11, 2020 •

edited

Loading

spiffxp commented Aug 21, 2020 •

edited

Loading

spiffxp commented Aug 21, 2020 •

edited

Loading

spiffxp commented Aug 24, 2020 •

edited

Loading

spiffxp commented Aug 27, 2020 •

edited

Loading