-
Notifications
You must be signed in to change notification settings - Fork 827
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate k8s-infra-prow-build to a nodepool with more IOPS #1187
Comments
/area prow |
Opened #1186 to start migrating to first option (14% more cost for ~100% more IOPS) |
New nodepool is up, old nodepool cordoned
|
And I forgot to disable autoscaling for pool3 until just now
|
Deleted boskos
Waiting on the following to finish up
|
Removed old nodepool with #1188 |
Holding this open to see what impact, if any, this has on the graphs shown in the description |
No real change in the graphs, other than a reflection of PR traffic. This certainly didn't make things worse and isn't urgently more expensive, so I'm not inclined to rollback at the moment. Supposedly PER_GB means increase the persistent disk size (ref: https://cloud.google.com/compute/docs/disks/review-disk-metrics#throttling_metrics) One option would be to increase disk size to the next "tier" and see what happens. But I think I'd like to do a little more reading and focused testing to understand what's going on, and what options we have. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle rotten It's possible (under pre-GA terms) to create node pools with Local SSD as of 1.18. We'd need to upgrade the cluster to that version first |
I'm really interested in seeing this happen, but I can't guarantee I'll have bandwidth this cycle, so leaving out of milestone. Gated on migrating cluster to 1.18 |
/priority backlog |
Provisioning a nodepool with I'll try cutting over some canary presubmits to see what the behavior is |
kubernetes/test-infra#23783 will cutover:
Which are all manually triggered |
After an initial round of canary jobs against a single PR, I have kicked off the canary jobs against a handful of arbitrary kubernetes/kubernetes PR's to trigger autoscaling and evaluate the node disk usage under some level of concurrency / load https://console.cloud.google.com/monitoring/dashboards/builder/f0163540-a8b7-4618-8308-66652d3d4794?project=k8s-infra-prow-build&dashboardBuilderState=%257B%2522editModeEnabled%2522:false%257D&timeDomain=1h is the dashboard I'm using to watch the pot boil Old pool is on left (pool4), new pool is on right (pool5). By default, Google Cloud Monitoring doesn't appear to let me manually set the Y axis scales. I added an arbitrary threshold to each graph to give them the same scale. You can see we're experiencing way less throttling with the new pool. |
Need to cost out and estimate quota before rolling this out more generally. The numbers look good enough that I'm interested in doing so. However... I legitimately can't tell that there's any immediately obvious speedup from doing this? I'll let the other jobs finish and take a look at PR history for a quick check tomorrow. The only other thing I can think that this might allow is lowering of some CPU/memory resource limits to pack jobs more densely if they're in fact not going to be as noisy to each other. That will probably require more attention than I have time for right now. |
https://cloud.google.com/compute/disks-image-pricing#localssdpricing Local SSDs are $30/mo, so x2 = 60/mo https://cloud.google.com/compute/vm-instance-pricing n1-highmem-8 are ~ 241/mo pool4 instances are n1-highmem-8 + 500GB pd-ssd = 241 + 85 = 326 That's about 7% savings... we could bring that to 16% if we used only 1 local SSD. Looking at our total spend just for k8s-infra-prow-build over the last year, it was ~258K. 7% savings would be ~18K, 16% ~40K. Not nothing, but not incredibly significant compared against our total budget. Quota:
Conclusions:
|
Looks like I'm going to have to force recreation of the node pool to drop the taint
|
out of curiosity, why 2 local ssds? |
Why not less:
Why not more:
|
OK, migration to new nodepool with local SSDs for ephemeral storage complete, see #2839 for details Throttled I/O is way down post-migration Throttled bytes from old pool nodes on the left, new pool nodes on the right |
I'll hold this open for a day to see if this had any negative impact, but I otherwise consider this issue closed. It'll take a bit to determine whether this has had any impact on job / build time. Again, my guess based on a brief survey of the canary jobs from yesterday is negligible impact at best, but hopefully fewer noisy neighbors. We lack a great way to display this data at present, though I suspect the data will be available in some form in bigquery |
/close |
@spiffxp: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign |
This is followup to #1168 and #1173, made possible by quota changes done via #1132
My goal is to make these graphs go down
Our jobs are hitting I/O limits (both IOPS and throughput). That was made extra clear last weekend when we switched to larger nodes, thus causing more jobs to share the same amount of I/O.
We're seeing more jobs scheduled into the cluster now that v1.20 PRs are being merged. While our worst case node performance is about the same, we are seeing more throttling across the cluster in aggregate.
Kubernetes doesn't give us a way to provision I/O, so we're left optimizing per-node performance. Based on https://cloud.google.com/compute/docs/disks/performance I think we can get just under 2x the IOPS for a ~14% increase in cluster cost.
From there, going to next tier would require a 90% increase in cost, for only 66% more performance.
The most ideal thing would be local SSD but:
emptyDir
volumes withhostPath
volumes? But that sounds like a maintenance nightmareThe text was updated successfully, but these errors were encountered: