Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MCM Out-Of-Tree Extensibility #178

Closed
amshuman-kr opened this issue Oct 10, 2018 · 20 comments · Fixed by #460
Closed

MCM Out-Of-Tree Extensibility #178

amshuman-kr opened this issue Oct 10, 2018 · 20 comments · Fixed by #460
Assignees
Labels
area/open-source Open Source (community, enablement, contributions, conferences, CNCF, etc.) related component/mcm Machine Controller Manager (including Node Problem Detector, Cluster Auto Scaler, etc.) effort/2m Effort for issue is around 2 months kind/epic Large multi-story topic kind/roadmap Roadmap BLI lifecycle/stale Nobody worked on this for 6 months (will further age) platform/all status/under-investigation Issue is under investigation topology/seed Affects Seed clusters
Milestone

Comments

@amshuman-kr
Copy link

amshuman-kr commented Oct 10, 2018

Problem

There are at least four possible approaches for out-of-tree drivers as documented in this proposal.

  1. MCM as GRPC server and the driver as GRPC client using bidirectional streaming over long-running GRPC connections (established by the driver during registration) for communication.
  2. MCM as GRPC client and the driver as GRPC server using non-streaming regular GRPC calls.
  3. Drivers as webhooks with MCM making calls to the driver webhook using some REST API.
  4. Drivers a controllers.

We have implemented option 1 already as a PoC. Now we need to discuss and decide on the right approach.

Arguments

MCM as GRPC server and driver as GRPC client

  • Pros
    • Already implemented.
    • Can work even with SecretRef and CloudConfig not standardised in the MachineClass spec.
    • Enables caching of MachineClass, Secret and CloudConfig if necessary.
    • More secure? Only MCM needs to expose a GRPC endpoint.
    • Drivers can work as sidecar containers or as sidecar deployments.
    • Cluster administrator can control drivers' access to kube-apiserver and if necessary block it completely.
  • Cons
    • Complex and potentially confusing design involving streaming API.
    • Implementing the driver could be complex as it involves handling the streaming messages.
  • Mitigation
    • The complexity can be mitigated by writing a GRPC-based call-back framework which can be used by driver implementation so that they can concentrate on the cloud-provider side without having to worry about the GRPC and MCM.

MCM as GRPC client and the driver as GRPC server

  • Pros
    • Simpler design involving non-streaming regular GRPC calls.
    • This approach might be similar to the approach taken in CSI etc.
    • Drivers can work as sidecar containers or as sidecar deployments.
    • Cluster administrator can control drivers' access to kube-apiserver and if necessary block it completely.
  • Cons
    • There are subtle differences in the requirements for out-of-tree drivers and those for CSI. For example, multi-tenancy.
    • Requires that SecretRef and CloudConfig are standardised in the MachineClass spec.
    • Caching of MachineClass, Secret and CloudConfig would be hard.
    • Less secure? Drivers need to expose a GRPC endpoint.
    • Implementing a driver would still be complex as it involves implementing a secure GRPC server.
  • Mitigation
    • Complexity can be mitigated by providing a GRPC-based call-back framework similar to option 1 which can be used by driver implementation so that they can concentrate on the cloud-provider side to enable the driver implementation to concentrate on the cloud-provider side without having to worry about the GRPC and MCM.
    • Difficulty in enabling caching and flexibility without having the SecretRef and CloudConfig standardised in the MachineClass can be mitigated by making the MCM also a GRPC server (in addition to the drivers as GRPC server) to expose API to fetch MachineClass, Secret and CloudConfig.

Drivers as webhooks

  • Pros
    • Simpler design and REST API has a lower barrier for entry than GRPC.
    • Cluster administrator can control drivers' access to kube-apiserver and if necessary block it completely. But this would have some implications on the mitigations below.
  • Cons
    • Requires that SecretRef and CloudConfig are standardised in the MachineClass spec.
    • Caching of MachineClass, Secret and CloudConfig would be practically impossible.
  • Mitigation
    • The only possible mitigation for not having the SecretRef, CloudConfig standardised in the MachineClass would be by making the driver fetch these from the kube-apiserver directly. But this would not only break the webhook pattern but also defeat the very purpose of having a driver framework and the fourth option might be better suited.

Drivers as controllers

  • Pros
    • Already adopted in the cluster-api community.
    • Common design-pattern in Kubernetes and does not introduce new patterns/frameworks.
    • Cluster administrator can control drivers' access to kube-apiserver but not block it completely.
    • Support from cloud providers like the AWS Service Operator make this approach more attractive.
  • Cons
    • Writing a controller is complex.
    • Some access to kube-apiserver would be required from the drivers (read access to Machines, MachineClasses, Secrets and CloudConfigs and write access to the Status sub-resource of Machines).
    • MachineController (at least the cloud-provider-specific parts) would have to be separated out of MCM image.
  • Mitigation
    • Complexity of writing a controller can be mitigated by writing a control-loop-based call-back framework similar to option 1 which can be used by driver implementation so that they can concentrate on the cloud-provider side instead of the complexity of writing a controller.
    • The need to have some minimal access to kube-apiserver from the drivers cannot be mitigated.
    • The MachineController (at least the cloud-provider-specific parts) can be separated into a separate image and the rest of MCM into a separate image.
@amshuman-kr
Copy link
Author

amshuman-kr commented Oct 10, 2018

@ggaurav10 You can copy this content into your proposal if you want ;-)

@ggaurav10
Copy link
Contributor

Thanks for summarising @amshuman-kr.
I have added link to this in the doc. PR - #169

@amshuman-kr
Copy link
Author

Current status:

  1. MCM as GRPC server and driver as GRPC client - already implemented here by @ggaurav10, @prashanth26 and @amshuman-kr.
  2. MCM as GRPC client and the driver as GRPC server - @hardikdr intends to try this approach to see how this compares with option 1 above.

@amshuman-kr
Copy link
Author

amshuman-kr commented Oct 11, 2018

FWIW "Drivers as controllers" (option 4) makes the most sense.

@prashanth26 prashanth26 added kind/epic Large multi-story topic area/open-source Open Source (community, enablement, contributions, conferences, CNCF, etc.) related component/machine-controller-manager platform/all size/l Size of pull request is large (see gardener-robot robot/bots/size.py) topology/seed Affects Seed clusters status/under-investigation Issue is under investigation labels Oct 16, 2018
@afritzler
Copy link
Member

Was there a decision on which option to follow here? I am currently working on a POC for the OpenStack out-of-tree implementation (similar to the one we have for AWS in the Gardener org) and was asking myself if I should move this any further.

@prashanth26
Copy link
Contributor

The decision was to implement the - MCM as GRPC client and the driver as GRPC server (CSI approach). @hardikdr has been working on this. He would be the right person to answer your query.

similar to the one we have for AWS in the Gardener org

Which repo to you refer to by this? This one - https://github.com/gardener/machine-controller-manager-provider-aws?

@afritzler
Copy link
Member

Exactly this one.

@prashanth26
Copy link
Contributor

I think this might be for the first (older) implementation. I think Hardik is working on a different approach. I would suggest you hold on from implementing further.

@afritzler
Copy link
Member

Ok, got you! Thx for the update!

@hardikdr
Copy link
Member

Sorry for late response.
Basically, as Prashanth mentioned - we may have drivers behaving as a gRPC server and MCM being client - I did a quick PoC here: https://github.com/hardikdr/mcm-drivers

  • Need to test and improve further though.

@afritzler
Copy link
Member

Thanks for the update! I basically took the gRPC client version of the AWS implementation and did the OpenStack port on my flight back from KubeCon. I will hold off with publishing it then and wait until we have a final agreement of how to proceed here.

@prashanth26 prashanth26 changed the title Finalize the approach for out-of-tree drivers Support for out-of-tree drivers Jan 14, 2019
@prashanth26 prashanth26 self-assigned this Jan 14, 2019
@prashanth26
Copy link
Contributor

prashanth26 commented Jan 14, 2019

Solution

The solution decided on is option-2 above - MCM as GRPC client and the driver as GRPC server

Acceptance Criteria

  • A new set of APIs (aligned with cluster API) to be decided on
  • Splitting of binaries into MCM & Drivers
  • Implemented out-of-tree for existing providers
    • AWS
      • Implementation
      • Validation
      • Unit Tests
    • Azure
      • Implementation
      • Validation
      • Unit Tests
    • GCP
      • Implementation
      • Validation
      • Unit Tests
    • Openstack
      • Implementation
      • Validation
      • Unit Tests
    • Alicloud
      • Implementation
      • Validation
      • Unit Tests
    • Packet
      • Implementation
      • Validation
      • Unit Tests
  • Refactoring and release
    • Create CI script for machine-spec
    • New Release for machine-spec
    • Unit tests for AWS
    • Create CI script for MCM-AWS
    • Rebase MCM/cmi-client with master.
    • Refactor unit tests for MCM
    • Refactor integration tests for MCM
    • Refactor CI script for new MCM
    • New release for machine-spec - 0.3.0
      • Adopt changes to reflect new CSI spec
    • New Release for machine-controller-manager-provider-aws
      • Align with new machine-spec 0.3.0 APIs
    • Release MCM - 0.22.0
      • Merge/Take in any pending required changes
      • Release - 0.22 will be the stable branch for stable development of Gardener until this merged.
    • New MCM release - 0.23.0 Issue #230.
  • Migration Plan
    • Try out how adoptions plan works with new APIs
    • Document the proposed migration
  • Adopted Gardener to align with the new APIs

Here

Approach was changed to 4

@gardener-robot-ci-1 gardener-robot-ci-1 added lifecycle/stale Nobody worked on this for 6 months (will further age) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Apr 13, 2019
@PadmaB PadmaB added this to the 1905a milestone May 13, 2019
@PadmaB PadmaB removed this from the 1905a milestone Jun 10, 2019
@prashanth26
Copy link
Contributor

Hi everyone,

After several rounds of discussion on the approach to be taken on OOT. We have decided to rework the existing OOT implementation to go with the controller based approach (4). The main reason for this being that the controller approach would provide more flexibility than the current gRPC approach which is in the works. We shall be making amends to the OOT implementation to accomodate this controller based implementation in the coming weeks. I shall update this issue with an PR to illustrate the changes.

@prashanth26
Copy link
Contributor

prashanth26 commented May 26, 2020

Solution

The solution decided on is option-4 above

Acceptance Criteria

  • A new set of APIs (aligned with cluster API) to be decided on
  • Splitting of binaries into MCM & Drivers
  • Release MCM with OOT support (0.5 week)
    • Add libraries for supporting OOT providers
    • Align with new APIs
    • Documentation for adding new provider support
    • Merge MCM/splitting-controllers2 PR #460 with master.
    • New MCM release - 0.30.0 Issue #230
  • Migration (1 week)
    • Will try to update documentation with the plan here
  • Implemented out-of-tree for existing providers and integrate into Gardener (Each might take about 1 week, depending on how we can parallelize the efforts)
    • Vsphere
      • Implementation
      • Validation
      • Unit Tests
      • Refactor
      • Integrate with Gardener
    • AWS
      • Implementation
      • Refactor
      • Validation
      • Refactor unit Tests
      • Integrate with Gardener
    • GCP
      • Implementation
      • Validation
      • Unit Tests
      • Refactor
      • Integrate with Gardener
    • Alicloud
      • Implementation
      • Validation
      • Unit Tests
      • Refactor
      • Integrate with Gardener
    • Azure
      • Implementation
      • Validation
      • Unit Tests
      • Integrate with Gardener
    • Openstack
      • Implementation
      • Validation
      • Unit Tests
      • Integrate with Gardener
    • Packet
      • Implementation
      • Validation
      • Unit Tests
      • Integrate with Gardener
  • Other minor refactorings (1w)
    • Integration tests for MCM to support OOT providers
    • Drain unit tests
    • Metrics APIs on Gardener

Definition of Done

  • Knowledge is distributed: Have you spread your knowledge in pair programming/code review?
  • Unit Tests are provided: Have you written automated unit tests or added manual NGPTT tickets?
  • Integration Tests are provided: Have you written automated integration tests?
  • Minimum API exposure: If you have added public API, was it really necessary/is it minimal?
  • Operations guide: Have you updated the operations guide

@gardener-robot
Copy link

@prashanth26 You have mentioned internal references in the public. Please check.

2 similar comments
@gardener-robot
Copy link

@prashanth26 You have mentioned internal references in the public. Please check.

@gardener-robot
Copy link

@prashanth26 You have mentioned internal references in the public. Please check.

@vlerenc vlerenc added this to the 2020-Q4 milestone Sep 24, 2020
@vlerenc vlerenc changed the title Support for out-of-tree drivers MCM Out-Of-Tree Extensibility Sep 24, 2020
@hoeltcl
Copy link

hoeltcl commented Oct 1, 2020

@vlerenc Why is it closed?

@prashanth26
Copy link
Contributor

prashanth26 commented Oct 1, 2020

You are right @hoeltcl . This epic is still on going. It gets closed everytime we merge a PR related to this.
/reopen

@gardener-robot gardener-robot reopened this Oct 1, 2020
@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Dec 18, 2020
@vlerenc vlerenc modified the milestones: 2020-Q4, 2021-Q1 Mar 5, 2021
@gardener-robot gardener-robot added effort/2m Effort for issue is around 2 months and removed priority/normal size/l Size of pull request is large (see gardener-robot robot/bots/size.py) labels Mar 8, 2021
@prashanth26
Copy link
Contributor

prashanth26 commented Jun 14, 2021

/close
The migration to OOT providers is complete.

However, there are small clean up tasks leftover like

  1. Move drain controller to a separate controller - Move drain logic into a separate controller #621
  2. Get rid of the deprecated in-tree machine controller - Remove deprecated in-tree code #622
  3. Adopt the integration tests - Integration test framework for MCM #216
  4. Improve metrics handling - Improve Monitoring/Alerting/Metrics #211

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/open-source Open Source (community, enablement, contributions, conferences, CNCF, etc.) related component/mcm Machine Controller Manager (including Node Problem Detector, Cluster Auto Scaler, etc.) effort/2m Effort for issue is around 2 months kind/epic Large multi-story topic kind/roadmap Roadmap BLI lifecycle/stale Nobody worked on this for 6 months (will further age) platform/all status/under-investigation Issue is under investigation topology/seed Affects Seed clusters
Projects
None yet
10 participants