Allow clusters with different infrastructure "vmType" node pools #338

CecileRobertMichon · 2020-06-04T17:54:29Z

Is your feature request related to a problem? Please describe.

It is very limiting to only be able to have nodes of one infrastructure type (vmas or vmss) per cluster. Some projects such as CAPZ allow adding and removing node pools and support both vmas and vmss (and maybe even orchestration modes in the future).

Describe the solution you'd like in detail

What are the technical blockers for enabling clusters with both VMSS and VMAS? From what I've seen, cloud provider seems to rely on the control planes' cloud provider config (azure.json) to reconcile the LoadBalancer used to expose services, and add nodes to a backend pool. It assumes VM or Scale Set instance naming based on the vmType. One thing to explore is enabling cloud provider to detect all VMs and VMSS instances in the cluster on its own, without needing to specify a vmType.

Describe alternatives you've considered

Additional context

kubernetes-sigs/cluster-api-provider-azure#680

CecileRobertMichon · 2020-06-04T17:55:33Z

cc @devigned @rbitia @feiskyer @andyzhangx

feiskyer · 2020-06-05T04:42:05Z

setting VMType to "vmss" would support both VMSS and VMAS nodes.

CecileRobertMichon · 2020-06-05T17:05:33Z

@feiskyer why do we need vmType then? if "vmss" supports both, can't we remove that field and always support both? why would you ever want to set it to "vmas"?

feiskyer · 2020-06-09T02:48:26Z

@CecileRobertMichon it is for ensuring new features won't break old clusters. New features on VMSS should not break VMAS nodes.

CecileRobertMichon · 2020-06-09T15:14:55Z

Understood thanks for the explanation @feiskyer! So do you recommend using vmType vmss for all new clusters?

feiskyer · 2020-06-10T00:40:37Z

it depends. If there're only VMAS nodes in the cluster, vmType shouldn't be vmss if basic LB is used.

CecileRobertMichon · 2020-06-10T15:14:20Z

Ok, so if we're using only standard LBs in CAPZ, we can always set vmType to vmss? Even if the cluster is vmas only?

feiskyer · 2020-06-11T02:24:59Z

@CecileRobertMichon right.

CecileRobertMichon · 2020-06-11T15:37:18Z

/close

k8s-ci-robot · 2020-06-11T15:37:25Z

@CecileRobertMichon: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jsturtevant · 2020-06-16T16:56:11Z

After this update was made to cluster api I am seeing the following in my logs for the control Manager:

W0616 16:51:13.328071       1 node_lifecycle_controller.go:1048] Missing timestamp for Node capz-cluster-3-md-0-szr8j. Assuming now as a timestamp.
E0616 16:51:15.101801       1 node_lifecycle_controller.go:155] error checking if node capz-cluster-3-md-0-szr8j is shutdown: not a vmss instance
E0616 16:51:20.199133       1 node_lifecycle_controller.go:155] error checking if node capz-cluster-3-md-0-szr8j is shutdown: not a vmss instance

@feiskyer is this something that is a concern?

CecileRobertMichon · 2020-07-06T20:59:39Z

/reopen

@feiskyer any ideas on the above error that James posted?

k8s-ci-robot · 2020-07-06T20:59:46Z

@CecileRobertMichon: Reopened this issue.

In response to this:

/reopen

@feiskyer any ideas on the above error that James posted?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

feiskyer · 2020-07-07T04:41:30Z

@CecileRobertMichon could you set log to --v=3 and check what logs may be related with error?

alexeldeib · 2020-07-20T19:58:12Z

I tried with --v=3 and created a cluster with VMAS control plane and workers (no vmss) while setting vmType to vmss. I didn't get a more verbose error than @jsturtevant, upping the verbosity further and will report back.

I did validate that this configuration works fine with LB services, so I'm not sure if there's any real functional impact. @jsturtevant did you see any functional failures besides the errors in the logs?

alexeldeib · 2020-07-20T21:49:34Z

Looks like we hit https://github.com/kubernetes/kubernetes/blob/5a529aa3a0dd3a050c5302329681e871ef6c162e/staging/src/k8s.io/cloud-provider/controllers/nodelifecycle/node_lifecycle_controller.go#L180-L183

Because this code doesn't get triggered: https://github.com/kubernetes/legacy-cloud-providers/blob/1f0c70db5b78070858ece1ef4d00513917e0a5e0/azure/azure_vmss.go#L192-L200

So we hit: https://github.com/kubernetes/legacy-cloud-providers/blob/1f0c70db5b78070858ece1ef4d00513917e0a5e0/azure/azure_vmss.go#L202-L204 which doesn't find the instance, because it is neither VMSS nor VMAS (technically).

Seems like a CAPZ issue -- we aren't creating Availability Sets for our nodes, just standalone VMs. I think if we put the nodes in an AS, it will "just work"? The calling code (first link) doesn't seem to be a terminal error anyway, but we should fix it.

@feiskyer sound about right to you?

alexeldeib · 2020-07-21T01:30:56Z

With Standard LB we need neither primaryScaleSetName nor primaryAvailabilitySetName if i'm reading the code correctly? Seems this will only be required for Basic LB (which we don't have in CAPZ)

https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/legacy-cloud-providers/azure/azure_loadbalancer.go#L299
https://github.com/kubernetes/kubernetes/blob/5a529aa3a0dd3a050c5302329681e871ef6c162e/staging/src/k8s.io/legacy-cloud-providers/azure/azure_loadbalancer.go#L332

feiskyer · 2020-07-21T02:37:44Z

Seems like a CAPZ issue -- we aren't creating Availability Sets for our nodes, just standalone VMs. I think if we put the nodes in an AS, it will "just work"? The calling code (first link) doesn't seem to be a terminal error anyway, but we should fix it.

Could you make all VMs be part of either VMAS or VMSS in capz?

With Standard LB we need neither primaryScaleSetName nor primaryAvailabilitySetName if i'm reading the code correctly? Seems this will only be required for Basic LB (which we don't have in CAPZ)

That's right. All VMs would be added to SLB backend pool.

alexeldeib · 2020-07-21T03:09:20Z

Thanks for the quick reply. We have an open issue to use availability sets when AZ aren't available.

I think I've figured out most of the rest, a couple more Qs if you dont mind:

resourceGroup <-> vnetResourceGroup: these can be different if the compute resources live in one resourceGroup, but the vnet lives in another? usually they would be the same

resourceGroup <-> loadBalancerResourceGroup: these can be different? under what circumstances?
loadBalancerResourceGroup <-> vnetResourceGroup: can these be different?

Most of the configuration values are for worker resources, right? e.g. subnet name, security group name. I see SG name is used for the worker load balancer. What values should differ between e.g. workers and control plane? Only credentials if desired? What about between two nodepools -- what's the impact of e.g. subnet being different in the cloud provider config, if any?

feiskyer · 2020-07-21T03:18:36Z

yea, all the above RG could be different.

alexeldeib · 2020-07-21T04:05:10Z

any comment on the last bit (subnet)? i'm curious how cloud provider uses this with multiple agent pools if they are in different subnets?

alexeldeib · 2020-07-21T04:08:55Z

seems like it's used for ILB only: https://github.com/kubernetes/kubernetes/blob/1fdd8fb213e0361e8f18b1dd152dddb4c88ad183/staging/src/k8s.io/legacy-cloud-providers/azure/azure_loadbalancer.go#L782

also this can probably be closed, i'm just abusing the issue to clarify the other fields we need 🙂

/close

k8s-ci-robot · 2020-07-21T04:09:02Z

@alexeldeib: Closing this issue.

In response to this:

seems like it's used for ILB only: https://github.com/kubernetes/kubernetes/blob/1fdd8fb213e0361e8f18b1dd152dddb4c88ad183/staging/src/k8s.io/legacy-cloud-providers/azure/azure_loadbalancer.go#L782

also this can probably be closed, i'm just abusing the issue to clarify the other fields we need 🙂

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

CecileRobertMichon mentioned this issue Jun 4, 2020

🐛 MachinePool template must have vmType vmss in control plane azure.json kubernetes-sigs/cluster-api-provider-azure#680

Merged

CecileRobertMichon mentioned this issue Jun 10, 2020

🐛 Use vmType "vmss" in every template to allow mixed mode kubernetes-sigs/cluster-api-provider-azure#695

Merged

3 tasks

k8s-ci-robot closed this as completed Jun 11, 2020

k8s-ci-robot reopened this Jul 6, 2020

CecileRobertMichon mentioned this issue Jul 6, 2020

🔧 combine AZURE_STANDARD_JSON_B64 and AZURE_VMSS_JSON_B64 into AZURE_JSON_B64 kubernetes-sigs/cluster-api-provider-azure#753

Merged

3 tasks

k8s-ci-robot closed this as completed Jul 21, 2020

jsturtevant mentioned this issue Aug 3, 2020

When running a workload with a single control plane node the load balancers take 15 mins to provision kubernetes-sigs/cluster-api-provider-azure#857

Closed

jsturtevant mentioned this issue Aug 3, 2020

If nodes are are not in a ready state when the controller manager boots, the cache is stale causing 15 min delay of LB deployment #363

Closed

CecileRobertMichon mentioned this issue Oct 29, 2020

getNodeVMSet assumes availability set or scale set kubernetes-sigs/azuredisk-csi-driver#576

Closed

CecileRobertMichon mentioned this issue Nov 24, 2020

cleanup: refactor Azure cache and remove redundant API calls kubernetes/autoscaler#3717

Merged

CecileRobertMichon mentioned this issue Mar 11, 2021

Fix: ignore not a VMSS error for VMAS nodes in reconcileBackendPools #551

Merged

CecileRobertMichon mentioned this issue Apr 20, 2021

Ensuring LB fails with ListAgentPoolLBs: failed to get agent pool vmSet names #596

Closed

andyzhangx mentioned this issue Sep 7, 2021

fix: get vmss instance error for standalone vm #798

Closed

CecileRobertMichon mentioned this issue Jan 11, 2023

Make vmType optional #3099

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow clusters with different infrastructure "vmType" node pools #338

Allow clusters with different infrastructure "vmType" node pools #338

CecileRobertMichon commented Jun 4, 2020

CecileRobertMichon commented Jun 4, 2020

feiskyer commented Jun 5, 2020

CecileRobertMichon commented Jun 5, 2020

feiskyer commented Jun 9, 2020

CecileRobertMichon commented Jun 9, 2020

feiskyer commented Jun 10, 2020

CecileRobertMichon commented Jun 10, 2020

feiskyer commented Jun 11, 2020

CecileRobertMichon commented Jun 11, 2020

k8s-ci-robot commented Jun 11, 2020

jsturtevant commented Jun 16, 2020

CecileRobertMichon commented Jul 6, 2020

k8s-ci-robot commented Jul 6, 2020

feiskyer commented Jul 7, 2020

alexeldeib commented Jul 20, 2020 •

edited

Loading

alexeldeib commented Jul 20, 2020 •

edited

Loading

alexeldeib commented Jul 21, 2020

feiskyer commented Jul 21, 2020

alexeldeib commented Jul 21, 2020

feiskyer commented Jul 21, 2020

alexeldeib commented Jul 21, 2020

alexeldeib commented Jul 21, 2020

k8s-ci-robot commented Jul 21, 2020

Allow clusters with different infrastructure "vmType" node pools #338

Allow clusters with different infrastructure "vmType" node pools #338

Comments

CecileRobertMichon commented Jun 4, 2020

CecileRobertMichon commented Jun 4, 2020

feiskyer commented Jun 5, 2020

CecileRobertMichon commented Jun 5, 2020

feiskyer commented Jun 9, 2020

CecileRobertMichon commented Jun 9, 2020

feiskyer commented Jun 10, 2020

CecileRobertMichon commented Jun 10, 2020

feiskyer commented Jun 11, 2020

CecileRobertMichon commented Jun 11, 2020

k8s-ci-robot commented Jun 11, 2020

jsturtevant commented Jun 16, 2020

CecileRobertMichon commented Jul 6, 2020

k8s-ci-robot commented Jul 6, 2020

feiskyer commented Jul 7, 2020

alexeldeib commented Jul 20, 2020 • edited Loading

alexeldeib commented Jul 20, 2020 • edited Loading

alexeldeib commented Jul 21, 2020

feiskyer commented Jul 21, 2020

alexeldeib commented Jul 21, 2020

feiskyer commented Jul 21, 2020

alexeldeib commented Jul 21, 2020

alexeldeib commented Jul 21, 2020

k8s-ci-robot commented Jul 21, 2020

alexeldeib commented Jul 20, 2020 •

edited

Loading

alexeldeib commented Jul 20, 2020 •

edited

Loading