Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate AzureMachine with AzureManagedControlPlane (BYO nodes on AKS) #826

Closed
alexeldeib opened this issue Jul 23, 2020 · 9 comments · Fixed by #4052
Closed

Integrate AzureMachine with AzureManagedControlPlane (BYO nodes on AKS) #826

alexeldeib opened this issue Jul 23, 2020 · 9 comments · Fixed by #4052
Assignees
Labels
area/managedclusters Issues related to managed AKS clusters created through the CAPZ ManagedCluster Type kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Milestone

Comments

@alexeldeib
Copy link
Contributor

alexeldeib commented Jul 23, 2020

/kind feature

Describe the solution you'd like
Extend CAPZ as necessary to allow CAPZ-managed AzureMachines to join AKS clusters provisioned by AzureManagedControlPlane. A PoC exists in #822, which itself relies on #824 and #825. On top of those changes, we need to add logic in AzureMachineController to allow instantiating the appropriate cluster describer at runtime using the duck typing from #825.

Anything else you would like to add:

Currently there are two issues with the PoC:

  • nodes register with master labels. I think this is due to using the admin kubeconfig to join nodes. we need to drop privileges and try to join in that way. We can also try doing a token join via kubeconfig with a low privilege serviceaccount.
  • need to validate azure-cni functionality and/or debug any issues there

Warning

This is 100% unsupported by AKS right now, I just think it's cool and I don't see any technical reasons it can't work. This feature would be indefinitely experimental unless AKS supported BYO node

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 23, 2020
@CecileRobertMichon CecileRobertMichon added this to the next milestone Aug 4, 2020
@alexeldeib
Copy link
Contributor Author

alexeldeib commented Sep 14, 2020

Issues I encountered working on https://github.com/kubernetes-sigs/cluster-api-provider-azure/compare/master...alexeldeib:ace/integrate?expand=1

Critical issues // functional problems

  • Currently nodes use an admin kubeconfig to join the cluster, then patch their own role and labels to drop privileges and become agents. This is awful security practice and I feel is the most critical item to fix before anyone could use this.
  • Azure CNI requires pre-allocating IP for pods to the VMs, which is not currently handled by CAPZ. This would be a critical requirement for anyone using Azure CNI, unless we can take advantage of another IPAM/IP allocation scheme.
  • Joining nodes with CAPBK uses 2 preKubeadmCommands and 2 postKubeadmCommands, which is less than ideal. We should consider encapsulating this more cleanly, either in a bootstrap provider or VHD/script somehow.

Minor issues // implementation details

  • group service should not always attempt to reconcile the resource group, and it should differentiate node/control plane resource groups
  • AKS vnet name is not predictable and nodes require it for provisioning. The managed cluster API does not return the vnet name, but we can extract it from the node pools API which contains vnetSubnetID as a fully qualified ARM resource
  • anywhere that currently uses a cluster scope now needs to accept either an AzureCluster or AzureManagedControlPlane as a ClusterDescriber, increasing code bloat.
  • Patching node labels/roles seems to work well on VM, but seems to have issues on VMSS when nodes get deleted and recreated?

Other observations // no action necessarily required

  • CAPZ VHD does not include Azure CNI, which would be required for nodes joining AKS cluster
  • AKS VHD does not include kubeadm, which would be required for CAPBK-driven bootstrap
  • Can't access AKS Shared Image Gallery (??) using a random client, get auth failures

More testing required

  • service type LoadBalancer (does it work?)
  • further validation on CNI functionality

@alexeldeib
Copy link
Contributor Author

/assign

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 13, 2020
@CecileRobertMichon
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 17, 2020
@CecileRobertMichon CecileRobertMichon added the area/managedclusters Issues related to managed AKS clusters created through the CAPZ ManagedCluster Type label Mar 16, 2021
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 14, 2021
@alexeldeib
Copy link
Contributor Author

/remove-lifecycle stale
/lifecycle frozen

still interested in this, but it's not super critical for the AzureMangedCluster/AzureManagedControlPlane functionality. I probably will have a small spike on some of the gritty details here, and then would need a lot of incremental refactoring to make the reconcilers/scopes work as expected.

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 15, 2021
@alexeldeib
Copy link
Contributor Author

note: we should add .status.externalManagedControlPlane to AMCP so node deletion in machine controller works properly:
https://github.com/kubernetes-sigs/cluster-api/blob/master/controllers/machine_controller.go#L442-L467

@bridgetkromhout
Copy link
Contributor

Related: kubernetes/kubernetes#112313

@CecileRobertMichon
Copy link
Contributor

This is no longer blocked since #3861 is done

@dtzar dtzar moved this to In Progress in CAPZ Planning Sep 18, 2023
@dtzar dtzar added this to the v1.12 milestone Sep 18, 2023
@dtzar dtzar added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Oct 31, 2023
@github-project-automation github-project-automation bot moved this from In Progress to Done in CAPZ Planning Nov 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/managedclusters Issues related to managed AKS clusters created through the CAPZ ManagedCluster Type kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

7 participants