Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to Standard_B2s VM SKU by default #2750

Merged
merged 1 commit into from
Dec 8, 2022

Conversation

mboersma
Copy link
Contributor

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

An internal Azure audit suggested some of our accounts try to use Standard_B2s VM SKUs instead of our default Standard_D2s_v3. I've been using it locally for all my testing and development and haven't seen any problems, so let's see if it might work for others (in particular, our CI account).

Which issue(s) this PR fixes:

N/A

Special notes for your reviewer:

The difference in the two VM SKUs is this:

SKU Cores Memory Storage Price
B2s 2 4GB RAM 8GB temp $0.058 / hr
D2s_v3 2 8GB RAM 16GB temp $0.209 / hr

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests

Release note:

Switch to Standard_B2s VM SKU by default

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 24, 2022
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 24, 2022
@mboersma
Copy link
Contributor Author

/hold

Even if this PR passes, we need to check that the test account has sufficient quota and that there is availability in most regions.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 24, 2022
@mboersma
Copy link
Contributor Author

So close, but AKS doesn't like it:

E1027 19:56:50.700817       1 controller.go:326]  
"msg"="Reconciler error" 
"error"="error creating AzureManagedControlPlane capz-e2e-t4o6mu/capz-e2e-t4o6mu-aks: failed to reconcile AzureManagedControlPlane service 
managedcluster: failed to create resource capz-e2e-t4o6mu-aks/capz-e2e-t4o6mu-aks (service: managedcluster): 
Code=\"VMCannotFitEphemeralOSDisk\" 
Message=\"The virtual machine size Standard_B2s has a cache size of 32212254720 bytes and temporary disk size of 8589934592 bytes, but the OS disk requires 42949672960 bytes. 
  Use a VM size with larger cache, larger temp disk, or disable ephemeral OS.\"" 
"azureManagedControlPlane"={"name":"capz-e2e-t4o6mu-aks","namespace":"capz-e2e-t4o6mu"} "controller"="azuremanagedcontrolplane" "controllerGroup"="infrastructure.cluster.x-k8s.io" 

@CecileRobertMichon
Copy link
Contributor

So close, but AKS doesn't like it

we could use a separate VM size for ephemeral OS scenarios

@mboersma mboersma force-pushed the standard-b2s branch 2 times, most recently from 3bf4fe8 to db714e9 Compare November 8, 2022 22:33
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 9, 2022
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 10, 2022
@nawazkh
Copy link
Member

nawazkh commented Nov 14, 2022

Ooh cost saving, nice!

we could use a separate VM size for ephemeral OS scenarios

How do we achieve handpicking VMs in "ephemeral OS scenarios" for testing in CI ? Via separate configs?
Also, what is implied by size in the statement? Is it the VM's memory, storage, or cache? Or all of it?

@jackfrancis jackfrancis added this to the v1.7 milestone Dec 1, 2022
@mboersma
Copy link
Contributor Author

mboersma commented Dec 7, 2022

How do we achieve handpicking VMs in "ephemeral OS scenarios" for testing in CI ?

It's a good question. I'm not sure yet. I'm hoping just to override AZURE_NODE_MACHINE_TYPE for the relevant test specs, but use Standard_B2s where we can.

@mboersma
Copy link
Contributor Author

mboersma commented Dec 7, 2022

/hold cancel

The CI Azure subscription appears to have generous quota for Standard BS family, equal to or better than it has for Standard DS Family.

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 7, 2022
@jackfrancis
Copy link
Contributor

/lgtm
/approve

Thanks for sticking with this @mboersma!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 7, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jackfrancis

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 7, 2022
@@ -65,9 +65,9 @@ worker-templates:
AZURE_RESOURCE_GROUP: test-resource-group-name
CONTROL_PLANE_MACHINE_COUNT: "1"
KUBERNETES_VERSION: v1.22.1
AZURE_CONTROL_PLANE_MACHINE_TYPE: Standard_D2s_v3
AZURE_CONTROL_PLANE_MACHINE_TYPE: Standard_B2s
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also effectively recommend that users start with Standard_B2s? It seems to handle our e2e workloads, but I don't know whether real-world use might find it lacking.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that the current existence of Standard_D2s_v3 is anything more than a "here's a known-working, barely operational example" for a VM SKU.

In other words, we have a long history of including in our reference configs the VM SKU that we use for testing — which includes in its positive criteria "cost-efficient" — (AFAICT this is the reason we use Standard_D2s_v3, not that it's actually a SKU we would recommend for real-world production use) so I don't think this amounts to a significant change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me, thanks for the thoughtful comment.

Also, it's really in the Cluster API Book Quick Start that we say anything that could be interpreted as a recommendation:

# Select VM types.
export AZURE_CONTROL_PLANE_MACHINE_TYPE="Standard_D2s_v3"
export AZURE_NODE_MACHINE_TYPE="Standard_D2s_v3"

and that's not going to change with this PR. So I guess this doesn't rock the boat that badly.

@mboersma
Copy link
Contributor Author

mboersma commented Dec 7, 2022

/hold

Just had a question above about whether Standard_B2s should be in docs, or just in our e2e configs.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 7, 2022
@mboersma
Copy link
Contributor Author

mboersma commented Dec 7, 2022

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 7, 2022
@nawazkh
Copy link
Member

nawazkh commented Dec 7, 2022

/lgtm++

@mboersma
Copy link
Contributor Author

mboersma commented Dec 7, 2022

/retest-required

Looks like a flake where 2/3 nodes came up, which I've seen outside this PR.

@k8s-ci-robot k8s-ci-robot merged commit 22cd003 into kubernetes-sigs:main Dec 8, 2022
@mboersma mboersma deleted the standard-b2s branch December 8, 2022 15:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants