Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: aks provider #482

Merged
merged 8 commits into from
May 14, 2020
Merged

Conversation

alexeldeib
Copy link
Contributor

@alexeldeib alexeldeib commented Apr 2, 2020

What this PR does / why we need it:
demo for a managed cluster with machine pools. Still a bit rough around the edges but basic scenarios (create, add pool, delete pool, scale) work well. Sharing for some feedback, finally got around to trying the approach from here: kubernetes-sigs/cluster-api#2045 (comment)

xref:
#337
kubernetes-sigs/cluster-api#2045
kubernetes-sigs/cluster-api#980

fyi @vincepri @detiber @jsturtevant for some 👀 to start

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):

Special notes for your reviewer:

To test this with Tilt, you can use the normal Azure config + this change in CAPI upstream:

"extra_args": {
    "core": ["--feature-gates=MachinePool=true", "--feature-gates=AKS=true"],
    "kubeadm-bootstrap": ["--feature-gates=MachinePool=true", "--feature-gates=AKS=true"],
    "azure": ["--feature-gates=MachinePool=true", "--feature-gates=AKS=true"]
}
kind create cluster
cd$(GOPATH)/src/sigs.k8s.io/cluster-api 
tilt up

# new terminal
cd$(GOPATH)/src/sigs.k8s.io/cluster-api-provider-azure
make create-aks-cluster

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

Release note:

Adds an AKS provider, implemented via AzureManagedControlPlane and AzureManagedMachinePool custom resources.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 2, 2020
@k8s-ci-robot k8s-ci-robot added area/provider/azure Issues or PRs related to azure provider sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Apr 2, 2020
@alexeldeib alexeldeib force-pushed the ace/caks branch 2 times, most recently from 76eb8fa to 3c4d4e3 Compare April 2, 2020 10:15
@alexeldeib alexeldeib force-pushed the ace/caks branch 2 times, most recently from 9952b5c to f8412eb Compare April 2, 2020 10:29
@alexeldeib
Copy link
Contributor Author

/retest

@alexeldeib alexeldeib force-pushed the ace/caks branch 2 times, most recently from 2e136ab to e52ad2b Compare April 2, 2020 10:42
@CecileRobertMichon
Copy link
Contributor

I think we should consider moving this to an /exp folder since it is a POC

@alexeldeib alexeldeib changed the title [wip] feat: poc aks provider [wip] feat: aks provider Apr 3, 2020
@alexeldeib
Copy link
Contributor Author

alexeldeib commented Apr 3, 2020

So I changed to title to reflect that this isn't really a proof of concept, since I'd attempted this once already independently: https://github.com/alexeldeib/caks/blob/ace/private/controllers/azuremanagedmachine_controller.go

I'd consider this PR the "real" attempt to clean everything up and get it usable, especially with machine pool and my learnings from the initial cut.

Summary

Briefly summarizing approach + remaining work for what i'd consider a mergable cut.

This PR introduces two new types, AzureManagedCluster and AzureManagedMachinePool, and their associated controllers. AzureManagedCluster maps fairly directly to the specification for an AKS cluster. For generic Kubernetes configuration, the CAPI Cluster object contains some of the required information, AzureManagedCluster only models the set of options specific to AKS bootstrapping.

AKS requires a default machine pool. We model this as part of the AzureMangedClusterSpec by adding a reference to an AzureManagedMachinePool in the same namespace. We require the user to explicitly define this node pool the same as any other (TODO: consider defaulting).

CAPI machine pools also map nicely into AKS agent pools, so nothing super surprising there. AKS still lets us see the underlying MC resources, and we record all the resulting VMSS nodes into the AzureManagedMachinePool.

This PR does not introduce any defaulting or validation. defaulting + update validation in particular might be something to consider in the future.

Remaining Work

  • fix pod/service cidr plumbing
  • ensure all AKS configuration knobs exposed through AzureManagedCluster actually get plumbed through to the AKS cluster
    • n.b.: I'm explicitly not saying the immediate goal is to support every permutation of AKS cluster. I'm starting from a fairly stripped down / blessed config.
  • testing in some shape/form
  • handle default pool deletion (aka, you can't do this)
  • investigate webhook authentication failures?

@detiber
Copy link
Member

detiber commented Apr 6, 2020

@alexeldeib I didn't dive too deep into the implementation yet, but I do wonder if it would provide a better user experience (or at least similar upgrade experience) if this also made use of a control plane provider as well instead of just AzureManagedCluster and AzureManagedMachinePool. I'm not sure if this would require moving the majority of the AzureManagedCluster implementation to an AzureManagedControlPlane and implementing a mostly noop AzureManagedCluster resource, or if it would require some more fine grained separation.

@alexeldeib
Copy link
Contributor Author

@detiber seems reasonable to me. I can shuffle things around and see if I hit any issues.

@alexeldeib alexeldeib force-pushed the ace/caks branch 2 times, most recently from 61971ec to b8244a6 Compare April 18, 2020 21:25
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Apr 23, 2020
@alexeldeib
Copy link
Contributor Author

@devigned @CecileRobertMichon I think this works fine as a control plane provider now, and I've cleaned up the CIDR issue. Going to work on testing and flesh out the AKS options, maybe validation on various edge cases after that. Would appreciate any glaring feedback you see, I know it's a big PR already.

@alexeldeib alexeldeib force-pushed the ace/caks branch 2 times, most recently from b0d1600 to 452d2f3 Compare May 7, 2020 01:22
Copy link
Member

@vincepri vincepri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed first half of all the changes, mostly nits :)

cloud/interfaces.go Show resolved Hide resolved
cloud/scope/managedcontrolplane.go Outdated Show resolved Hide resolved
return errors.New("expected agent pool specification")
}

if err == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we missing the generic err != nil or we don't care about it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following paths are a little hard to read, maybe use a case switch with some comments on what the logic is doing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, I originally left it off because we wouldn't care about transient errors but I should have added back when I differentiated between Create and Update.

I'll also add a doc comment that the reason for differentiating the paths is that you can't use the ManagedClusters API to create agent pools on update, only at creation time.

cloud/services/agentpools/agentpools.go Outdated Show resolved Hide resolved
cloud/services/agentpools/agentpools.go Show resolved Hide resolved
cloud/services/managedclusters/managedclusters.go Outdated Show resolved Hide resolved
Comment on lines +102 to +106
func (s *Service) Reconcile(ctx context.Context, spec interface{}) error {
managedClusterSpec, ok := spec.(*Spec)
if !ok {
return errors.New("expected managed cluster specification")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, re:avoiding interface

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 this is vestigial and we are working on a design proposal to change this across the entire project. At this point, it's followed to be consistent with the rest of the repo.

@CecileRobertMichon or @justaugustus may have more background.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep right now interfaces are in a bunch of places, let's stay consistent for now - currently working on a proposal to refactor all of them as David said.

Comment on lines +151 to +157
// HACK: set the last octet of the IP to .10
// This ensures the dns IP is valid in the service cidr without forcing the user
// to specify it in both the Capi cluster and the Azure control plane.
// https://golang.org/src/net/ip.go#L48
ip[15] = byte(10)
dnsIP := ip.String()
properties.NetworkProfile.DNSServiceIP = &dnsIP
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤨

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open to suggestions 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n.b., consider how defaulting should work if we allow but don't require both dns and service IP as separate values (or if that's valuable).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be something that we might move into a defaulting webhook? And probably some docs I guess

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I was mostly trying to scope down this PR as I keep finding things to add 🙂

} else if strings.EqualFold(*managedClusterSpec.NetworkPolicy, "Calico") {
properties.NetworkProfile.NetworkPolicy = containerservice.NetworkPolicyCalico
} else {
return fmt.Errorf("invalid network policy: '%s'. Allowed options are 'calico' and 'azure'", *managedClusterSpec.NetworkPolicy)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be in a validation webhook or kubebuilder enum?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, I'm marking webhooks as work for a followup because this is already a large PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll create issues when this actually merges // make a list in the PR

Copy link
Contributor Author

@alexeldeib alexeldeib May 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did do the Kubebuilder enums, btw, do we currently support clusters without structural schema?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added these

// +kubebuilder:validation:Enum=Calico;Azure

Ignore the comment about structural schema, I was confusing that with trivialVersions conversion in Kubebuilder.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, that should do it!

cloud/services/managedclusters/managedclusters.go Outdated Show resolved Hide resolved
Co-authored-by: Vince Prignano <[email protected]>
Signed-off-by: Alexander Eldeib <[email protected]>
@CecileRobertMichon
Copy link
Contributor

@alexeldeib is this ready to go?

@alexeldeib
Copy link
Contributor Author

alexeldeib commented May 12, 2020

I think so, I'll start breaking things out into separate issues:

@CecileRobertMichon
Copy link
Contributor

@alexeldeib please open the follow up issues and link them to the PR. I'll do a last round review in the meantime. Thanks!

Copy link
Contributor

@CecileRobertMichon CecileRobertMichon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve
/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 14, 2020
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 14, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alexeldeib, CecileRobertMichon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 14, 2020
@k8s-ci-robot k8s-ci-robot merged commit 4b04993 into kubernetes-sigs:master May 14, 2020
@k8s-ci-robot k8s-ci-robot added this to the v0.5 milestone May 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/provider/azure Issues or PRs related to azure provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants