Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow clusters with different infrastructure "vmType" node pools #338

Closed
CecileRobertMichon opened this issue Jun 4, 2020 · 23 comments
Closed

Comments

@CecileRobertMichon
Copy link
Contributor

Is your feature request related to a problem? Please describe.

It is very limiting to only be able to have nodes of one infrastructure type (vmas or vmss) per cluster. Some projects such as CAPZ allow adding and removing node pools and support both vmas and vmss (and maybe even orchestration modes in the future).

Describe the solution you'd like in detail

What are the technical blockers for enabling clusters with both VMSS and VMAS? From what I've seen, cloud provider seems to rely on the control planes' cloud provider config (azure.json) to reconcile the LoadBalancer used to expose services, and add nodes to a backend pool. It assumes VM or Scale Set instance naming based on the vmType. One thing to explore is enabling cloud provider to detect all VMs and VMSS instances in the cluster on its own, without needing to specify a vmType.

Describe alternatives you've considered

Additional context

kubernetes-sigs/cluster-api-provider-azure#680

@CecileRobertMichon
Copy link
Contributor Author

@feiskyer
Copy link
Member

feiskyer commented Jun 5, 2020

setting VMType to "vmss" would support both VMSS and VMAS nodes.

@CecileRobertMichon
Copy link
Contributor Author

@feiskyer why do we need vmType then? if "vmss" supports both, can't we remove that field and always support both? why would you ever want to set it to "vmas"?

@feiskyer
Copy link
Member

feiskyer commented Jun 9, 2020

@CecileRobertMichon it is for ensuring new features won't break old clusters. New features on VMSS should not break VMAS nodes.

@CecileRobertMichon
Copy link
Contributor Author

Understood thanks for the explanation @feiskyer! So do you recommend using vmType vmss for all new clusters?

@feiskyer
Copy link
Member

it depends. If there're only VMAS nodes in the cluster, vmType shouldn't be vmss if basic LB is used.

@CecileRobertMichon
Copy link
Contributor Author

Ok, so if we're using only standard LBs in CAPZ, we can always set vmType to vmss? Even if the cluster is vmas only?

@feiskyer
Copy link
Member

@CecileRobertMichon right.

@CecileRobertMichon
Copy link
Contributor Author

/close

@k8s-ci-robot
Copy link
Contributor

@CecileRobertMichon: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jsturtevant
Copy link
Contributor

After this update was made to cluster api I am seeing the following in my logs for the control Manager:

W0616 16:51:13.328071       1 node_lifecycle_controller.go:1048] Missing timestamp for Node capz-cluster-3-md-0-szr8j. Assuming now as a timestamp.
E0616 16:51:15.101801       1 node_lifecycle_controller.go:155] error checking if node capz-cluster-3-md-0-szr8j is shutdown: not a vmss instance
E0616 16:51:20.199133       1 node_lifecycle_controller.go:155] error checking if node capz-cluster-3-md-0-szr8j is shutdown: not a vmss instance

@feiskyer is this something that is a concern?

@CecileRobertMichon
Copy link
Contributor Author

/reopen

@feiskyer any ideas on the above error that James posted?

@k8s-ci-robot k8s-ci-robot reopened this Jul 6, 2020
@k8s-ci-robot
Copy link
Contributor

@CecileRobertMichon: Reopened this issue.

In response to this:

/reopen

@feiskyer any ideas on the above error that James posted?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@feiskyer
Copy link
Member

feiskyer commented Jul 7, 2020

@CecileRobertMichon could you set log to --v=3 and check what logs may be related with error?

@alexeldeib
Copy link

alexeldeib commented Jul 20, 2020

I tried with --v=3 and created a cluster with VMAS control plane and workers (no vmss) while setting vmType to vmss. I didn't get a more verbose error than @jsturtevant, upping the verbosity further and will report back.

I did validate that this configuration works fine with LB services, so I'm not sure if there's any real functional impact. @jsturtevant did you see any functional failures besides the errors in the logs?

@alexeldeib
Copy link

alexeldeib commented Jul 20, 2020

Looks like we hit https://github.com/kubernetes/kubernetes/blob/5a529aa3a0dd3a050c5302329681e871ef6c162e/staging/src/k8s.io/cloud-provider/controllers/nodelifecycle/node_lifecycle_controller.go#L180-L183

Because this code doesn't get triggered: https://github.com/kubernetes/legacy-cloud-providers/blob/1f0c70db5b78070858ece1ef4d00513917e0a5e0/azure/azure_vmss.go#L192-L200

So we hit: https://github.com/kubernetes/legacy-cloud-providers/blob/1f0c70db5b78070858ece1ef4d00513917e0a5e0/azure/azure_vmss.go#L202-L204 which doesn't find the instance, because it is neither VMSS nor VMAS (technically).

Seems like a CAPZ issue -- we aren't creating Availability Sets for our nodes, just standalone VMs. I think if we put the nodes in an AS, it will "just work"? The calling code (first link) doesn't seem to be a terminal error anyway, but we should fix it.

@feiskyer sound about right to you?

@alexeldeib
Copy link

With Standard LB we need neither primaryScaleSetName nor primaryAvailabilitySetName if i'm reading the code correctly? Seems this will only be required for Basic LB (which we don't have in CAPZ)

https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/legacy-cloud-providers/azure/azure_loadbalancer.go#L299
https://github.com/kubernetes/kubernetes/blob/5a529aa3a0dd3a050c5302329681e871ef6c162e/staging/src/k8s.io/legacy-cloud-providers/azure/azure_loadbalancer.go#L332

@feiskyer
Copy link
Member

Seems like a CAPZ issue -- we aren't creating Availability Sets for our nodes, just standalone VMs. I think if we put the nodes in an AS, it will "just work"? The calling code (first link) doesn't seem to be a terminal error anyway, but we should fix it.

Could you make all VMs be part of either VMAS or VMSS in capz?

With Standard LB we need neither primaryScaleSetName nor primaryAvailabilitySetName if i'm reading the code correctly? Seems this will only be required for Basic LB (which we don't have in CAPZ)

That's right. All VMs would be added to SLB backend pool.

@alexeldeib
Copy link

Thanks for the quick reply. We have an open issue to use availability sets when AZ aren't available.

I think I've figured out most of the rest, a couple more Qs if you dont mind:

resourceGroup <-> vnetResourceGroup: these can be different if the compute resources live in one resourceGroup, but the vnet lives in another? usually they would be the same

resourceGroup <-> loadBalancerResourceGroup: these can be different? under what circumstances?
loadBalancerResourceGroup <-> vnetResourceGroup: can these be different?

Most of the configuration values are for worker resources, right? e.g. subnet name, security group name. I see SG name is used for the worker load balancer. What values should differ between e.g. workers and control plane? Only credentials if desired? What about between two nodepools -- what's the impact of e.g. subnet being different in the cloud provider config, if any?

@feiskyer
Copy link
Member

yea, all the above RG could be different.

@alexeldeib
Copy link

any comment on the last bit (subnet)? i'm curious how cloud provider uses this with multiple agent pools if they are in different subnets?

@alexeldeib
Copy link

seems like it's used for ILB only: https://github.com/kubernetes/kubernetes/blob/1fdd8fb213e0361e8f18b1dd152dddb4c88ad183/staging/src/k8s.io/legacy-cloud-providers/azure/azure_loadbalancer.go#L782

also this can probably be closed, i'm just abusing the issue to clarify the other fields we need 🙂

/close

@k8s-ci-robot
Copy link
Contributor

@alexeldeib: Closing this issue.

In response to this:

seems like it's used for ILB only: https://github.com/kubernetes/kubernetes/blob/1fdd8fb213e0361e8f18b1dd152dddb4c88ad183/staging/src/k8s.io/legacy-cloud-providers/azure/azure_loadbalancer.go#L782

also this can probably be closed, i'm just abusing the issue to clarify the other fields we need 🙂

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants