Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CAPZ should use Out of Tree cloud-controller-manager and Storage Drivers #715

Closed
jseely opened this issue Jun 18, 2020 · 22 comments · Fixed by #3105
Closed

CAPZ should use Out of Tree cloud-controller-manager and Storage Drivers #715

jseely opened this issue Jun 18, 2020 · 22 comments · Fixed by #3105
Assignees
Labels
kind/proposal Issues or PRs related to proposals. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Milestone

Comments

@jseely
Copy link
Contributor

jseely commented Jun 18, 2020

⚠️ Cluster API Azure maintainers can ask to turn an issue-proposal into a CAEP when necessary. This is to be expected for large changes that impact multiple components, breaking changes, or new large features.

Dependencies

  1. Cluster ResourceSet needs to be implemented to properly support this

Goals

  1. CAPZ clusters should be deploy-able using OOT Azure Provider and Storage Drivers
  2. Default should be OOT
  3. Tests need to be updated to test both modes and migration scenario from in-tree to OOT

Non-Goals/Future Work

  1. Implement Cluster ResourceSet

User Story

As an operator I would like to separate the cloud provider integration from the kubernetes binaries and use the newer Storage Drivers and cloud-provider-azure.

Detailed Description

In 2018/2019 Kubernetes started to externalize interactions with the underlying cloud provider to slow down the growth in size of Kubernetes binaries and to decouple the lifecycle and development of Kubernetes from that of the individual cloud provider integrations.
https://kubernetes.io/blog/2019/04/17/the-future-of-cloud-providers-in-kubernetes/

/kind proposal

@k8s-ci-robot k8s-ci-robot added the kind/proposal Issues or PRs related to proposals. label Jun 18, 2020
@jseely
Copy link
Contributor Author

jseely commented Jun 18, 2020

@alexeldeib
Copy link
Contributor

Have you already seen the doc and template? Might help to distinguish this issue from what's already possible by adding some additional details? ClusterResourceSet is one approach to automate this, but I see you've listed that as a non-goal (and dependency)?

@CecileRobertMichon
Copy link
Contributor

CecileRobertMichon commented Jun 18, 2020

The 2nd goal "Default should be OOT" is something we're not necessarily ready for. I think for now we want to support optionally using OOT (without any manual steps, possibly using ClusterResourceSet), but I don't we'll want to move this to be the default right away to align with other Azure provisioning tools. cc @feiskyer @ritazh

See kubernetes/enhancements#667 for current Azure OOT provider status

@CecileRobertMichon CecileRobertMichon added this to the next milestone Jul 10, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 8, 2020
@CecileRobertMichon
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 3, 2020
@CecileRobertMichon
Copy link
Contributor

/priority important-longterm

@k8s-ci-robot k8s-ci-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Nov 3, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 2, 2021
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 4, 2021
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@CecileRobertMichon CecileRobertMichon removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Apr 5, 2021
@CecileRobertMichon
Copy link
Contributor

CecileRobertMichon commented Apr 5, 2021

/lifecycle frozen

status update:

Cluster ResourceSet needs to be implemented to properly support this

done

CAPZ clusters should be deploy-able using OOT Azure Provider and Storage Drivers

done

Default should be OOT

hold until OOT is fully ready

Tests need to be updated to test both modes and migration scenario from in-tree to OOT

Added tests for OOT (already testing in tree). Not testing migration currently.

Implement Cluster ResourceSet

Done in #1216

@CecileRobertMichon
Copy link
Contributor

Default should be OOT

Now that v1.0.0 has been released, we should be able to move forward with this

@CecileRobertMichon
Copy link
Contributor

/assign

@shysank
Copy link
Contributor

shysank commented Mar 29, 2022

cc @sonasingh46

@sonasingh46
Copy link
Contributor

sonasingh46 commented Mar 30, 2022

I have been trying to validate this manually. Especially around kubernetes 1.22 --> 1.23 upgrade paths.
The following in-tree components for azure are the points of attention :

AzureDiskCSI Driver
AzureFileCSI Driver
Cloud-provider-azure

As an effort to extract the cloud provider dependency from Kubernetes, the cloud provider dependent code is moving out from the in-tree Kubernetes.
As a result of this, the in-tree csi drivers and cloud providers are moving out of the Kubernetes code base.

From Kubernetes version 1.23 the azureDiskCSIDriver migration is enabled by default. This means that to provision a volume via azureDiskCSIDriver, it will require to install the external azureDiskCSIDriver as the in-tree azureDiskCSIDriver won't work in Kubernetes 1.23 as azureDiskCSIDriver migration is enabled by default.

The in-tree azureFileCSIDriver will continue to work in 1.23 as azureFileCSIDriver migration is not enabled by default in 1.23. If azureFileCSIDriver migration is enabled by user/admin then external azureFileCSIDriver needs to be installed.

Consider the following upgrade paths from v1.22 to v1.23:

Scenario1: Upgrade cluster from Kubernetes version 1.22 to 1.23 without any extra tuning and configuration

  • AzureDiskCSI migration is enabled by default on the upgraded cluster.
  • External azureDiskCSI driver must be installed so that pods using existing volume from in-tree AzureDiskCSI driver of 1.22 will continue to work on upgraded cluster.
  • To create new volume, external azureDiskCSI driver must be installed. One way of installing is via CRS.
  • AzureFileCSI migration is disabled by default.
  • Existing volumes created from in-tree AzureFileCSI driver of 1.22 will continue to work on upgraded cluster.
  • New azure file volumes can be created without any external driver installation.
  • In-tree CCM is enabled by default.

Scenario2: Upgrade cluster from Kubernetes version 1.22 to 1.23 by disabling AzureDiskCSIMigration

  • AzureDiskCSI migration will be disabled the upgraded cluster.
  • Existing volumes created from in-tree AzureDiskCSI driver of 1.22 will continue to work on upgraded cluster via in-tree AzureDiskCSI driver.
  • New azure file volumes can be created without any external driver installation.
  • AzureFileCSI migration is disabled by default.
  • Existing volumes created from in-tree AzureFileCSI driver of 1.22 will continue to work on upgraded cluster via in-tree AzureFileCSI driver.
  • New azure file volumes can be created without any external driver installation.
  • In-tree CCM is enabled by default.

PS: Still validating other scenarios

@sonasingh46
Copy link
Contributor

sonasingh46 commented Mar 30, 2022

Scenario3: Upgraded cluster from Kubernetes version 1.22 to 1.23 by enabling external cloud provider

  • The upgrade failed. The new control plane machine did not pass the perFlight checks. Readiness and startup probe failed for the control plane components on the new control plane machine that came up.

  • To fix this, we may need to enable the external volume plugin. ( WIP )

@mboersma
Copy link
Contributor

@jackfrancis and @Jont828, is this something that should land in milestone v1.5, or will it probably hit the next one?

@Jont828
Copy link
Contributor

Jont828 commented Jul 21, 2022

I'm not too sure, is there a PR open or being worked on for this ATM? Looks like Jack was assigned on it so maybe we can ask him when he's back.

@jackfrancis
Copy link
Contributor

I think we can land this in the next milestone

@mboersma
Copy link
Contributor

/milestone next

@CecileRobertMichon
Copy link
Contributor

/assign
/milestone v1.8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/proposal Issues or PRs related to proposals. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Projects
None yet
10 participants