Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continuous ClusterResourceSetStrategy #4807

Closed
Promaethius opened this issue Jun 10, 2021 · 22 comments · Fixed by #7497
Closed

Continuous ClusterResourceSetStrategy #4807

Promaethius opened this issue Jun 10, 2021 · 22 comments · Fixed by #7497
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@Promaethius
Copy link

Promaethius commented Jun 10, 2021

User Story

ClusterResourceSets provide a unique Kubernetes experience where a Cluster and its workload can be defined in a single .yaml file, templated out in CI/CD pipelines, or controlled by centralized management infrastructure. However, ApplyOnce falls short in terms of dependent objects and changing application definitions. Adding the option for ContinuousApply opens new strategies for reconciliation, dependency trees, and CI/CD pipelines.

Detailed Description

mode: ContinuousApply performs a hash check on interval for the target object. If the object does not exist on the destination cluster, apply it. (Even though ApplyOnce performs this, the resulting interval is spread out fairly far; example, applying an operator and a crd that the operator creates can take up to 15 minutes with ApplyOnce.) If the object does exist on the destination cluster, calculate a hash for the source object and destination object; if they do not match, apply source object.

Anything else you would like to add:

/kind feature

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jun 10, 2021
@sbueringer
Copy link
Member

If I get it correctly, we should consolidate this issue and #4799

@Promaethius
Copy link
Author

@sbueringer huh same day, what are the odds. Yeah basically. I'd like to work on it.

@sedefsavas
Copy link

cc @gab-satchi

@Promaethius
Copy link
Author

Semi part of the user story, I'd also like to add a reconciliation interval option for both ApplyOnce and Continuous Apply.

@Promaethius
Copy link
Author

/assign
Going to start work on this.

@vincepri
Copy link
Member

/milestone Next

@dlipovetsky
Copy link
Contributor

@Promaethius I noticed you assigned yourself a few months ago. Are you still working on this? I'd like to help move this idea forward.

@g-gaston
Copy link
Contributor

g-gaston commented Nov 2, 2021

I opened #5555 and we closed it in favor of this one. Copy-pasting here the interesting bits

As mentioned in the design proposal, ClusterResourceSet's only support the ApplyOnce mode. This makes impossible to update such resources without interacting with the workload clusters directly. It makes cluster maintenance a bit more cumbersome, since objects need to be reapplied on each cluster individually, as opposed to letting cluster-api manage that complexity. Plus it doesn't guarantee that all workload clusters have the same version of such objects.

The VSphere provider right now uses "by default" a ClusterResourceSet for the CPI and CSI. So I believe that when having VSphere production clusters, ClusterResourceSet, even if still experimental, is not necessarily a nice to have feature anymore, but a key component that would benefit from a better lifecycle management.

Another note: I was expecting, as workaround for this issue, that resources would be reapplied when creating a new ClusterResourceSet pointing to the same objects (since this CRS wouldn't have a ClusterResourceSetBinding). However, that doesn't happen because of how this is implemented. I believe the CAEP doesn't specify that objects won't be reapplied if a new CRS is created, but it also doesn't specify the opposite. Is this maybe something you might be open to change while we work on a new mode?

As I said in the original issue, I'm more than happy to take this if no one else has already or help whoever is currently working on it. I'm available to start working on it right away.

@g-gaston
Copy link
Contributor

This issue seemed stale so I went ahead and wrote kind of a draft for a proposal. I don't even know if a change like this would require a design proposal and this one is pretty barebones, but I hope it works as a starting point for a conversation.

Maybe this needs to be presented in a community meeting but I thought it was better to post it here first to see if @Promaethius is still working on it and collect other folks thoughts about next steps.

Let me know what y'all think 🙂

@vincepri @dlipovetsky @sbueringer

ClusterResourceSet Reconcile mode

Glossary

Refer to the Cluster API Book Glossary.

Summary

Provide a mechanism for reconciling resources defined in a ClusterResourceSet, after creation, by interacting exclusively with the management cluster.

Motivation

Currently, ClusterResourceSet's only support the ApplyOnce mode.
This makes it impossible to update such resources after creation without interacting with workload clusters directly.
As a result, cluster maintenance becomes a bit cumbersome, since objects need to be reapplied individually on each workload cluster.
Plus it doesn't guarantee that all workload clusters have the same version of such objects at a point in time.

Having a mechanism to reconcile the resources managed by ClusterResourceSet's would make cluster maintenance simpler and more intuitive, being able to just update ConfigMap's and Secret's in the management cluster and letting Cluster API manage the complexity of applying those changes in all targeted clusters. This would facilitate the use of CusterResourceSet's in CI/CD pipelines and/or centralized infrastructure systems.

Moreover, some providers, like vSphere, rely "by default" on a ClusterResourceSet to install vital components for the cluster, like the CPI and CSI. ClusterResourceSet, even if still experimental, have became more than a nice to have feature, it's already a key component in the Cluster API ecosystem.

To achieve this, a new Reconcile mode is introduced for ClusterResourceSet's. For such mode, the controller will reapply the set of resources on the workload clusters when their definition change in the management cluster.

Goals

  • Provide a way to propagate changes in resources defined in a ClusterResourceSet to all targeted clusters

Non-Goals/Future Work

  • Detect drift when the resources in the workload clusters are directly modified by an external entity
  • Reapply resources periodically
  • Support deletion of resources from clusters

Proposal

This proposal adds a new mode Reconcile to ClusterResourceSet's, that will re-apply a resource in the targeted workload clusters where the resource/s definition's hash (in the management cluster) changes from the last time it was applied.

User Stories

Story 1

As someone using ClusterResourceSet's to install resources in multiple clusters, I want to be able to update those resources by just updating the resource's definitions in the management cluster, so I don't have to manually repeat the update for each targeted cluster.

Story 2

As someone using the default Cluster API vSphere provider template, I want to be able to update the CPI and CSI by just updating the the ConfigMap's and Secret's in the management cluster, so I don't have to manually repeat the apply for each targeted cluster.

Implementation Details/Notes/Constraints

Data model changes to existing API types

The only change is the addition of the Reconcile value to ClusterResourceSet.spec.mode. This is an enum which means the CRD needs to be modified for the OpenAPI spec to allow a new value.

apiVersion: addons.cluster.x-k8s.io/v1alpha3
kind: ClusterResourceSet
metadata:
 name: crs1
 namespace: default
spec:
 mode: "Reconcile"
 clusterSelector:
   matchLabels:
     cni: calico
 resources:
   - name: db-secret
     kind: Secret
   - name: calico-addon
     kind: ConfigMap

Detecting changes

The current implementation of ClusterResourceSet's (with just ApplyOnce mode) already calculates a consistent hash for the resource/s definitions and stores it in the ResourceSetBinding. We will use this same mechanism to detect changes, comparing the hash of the current resource/s definition with the one stored in the ResourceSetBinding.

Note that this hash will change when any of the resources is updated, a resource is added or a resource is removed. This means that all resources in the same ConfigMap or Secret, and not only the one that changed, will be reapplied in any of these 3 cases. It also means that resources removed from ConfigMap or Secret won't be deleted from the target clusters.

In the next before-after example we can see that only one resource has changed (ConfigMap calico-configmap). However, all the 3 resources (calico-secret1, calico-secret2 and calico-configmap) will be reapplied.

Before:

apiVersion: v1
kind: ConfigMap
metadata:
  name: calico-addon
data:
  calico1.yaml: |-
     kind: Secret
     apiVersion: v1
     metadata:
      name: calico-secret1
      namespace: mysecrets
      ---
     kind: Secret
     apiVersion: v1
     metadata:
      name: calico-secret2
      namespace: mysecrets
  calico2.yaml: |-
     kind: ConfigMap
     apiVersion: v1
     metadata:
      name: calico-configmap
      namespace: myconfigmaps
     data:
       key: "original value"

After:

apiVersion: v1
kind: ConfigMap
metadata:
  name: calico-addon
data:
  calico1.yaml: |-
     kind: Secret
     apiVersion: v1
     metadata:
      name: calico-secret1
      namespace: mysecrets
      ---
     kind: Secret
     apiVersion: v1
     metadata:
      name: calico-secret2
      namespace: mysecrets
  calico2.yaml: |-
     kind: ConfigMap
     apiVersion: v1
     metadata:
      name: calico-configmap
      namespace: myconfigmaps
     data:
       key: "value that changed"

Drift

The proposed solution only deals with changes in the resources' definitions and not with changes in the real objects in the workload clusters. If those objects are modified or deleted in the workload clusters, the ClusterResourceSet's controller won't do anything and they will remain unchanged until their definition in the management cluster is updated.

This could potentially be mitigated by:

  • Implementing a "periodic" reconciliation mode where resources are reapplied with a certain frequency even their hash hasn't changed.
  • Storing the compounded Generation of the applied objects in the ResourceSetBinding. Since Generation is a monotonically increasing integer, a change in the compounded generation (adding all the Generation fields for all the resources) always means at least one resource changed in the workload cluster. With this mechanism, the hash can be used to detect changes in the resources definition and the compounded generation to detect changes in the actual workload cluster's resource.

Risks and Mitigations

Alternatives

The Alternatives section is used to highlight and record other possible approaches to delivering the value proposed by a proposal.

Upgrade Strategy

This proposal only introduces a new possible value for the Mode field and leaves the current behavior for ApplyOnce untouched, so there is no need to upgrade existing clusters.

Additional Details

Test Plan

Extensive unit testing when applying ClusterResourceSet resources with the new mode. e2e testing as part of the cluster-api e2e test suite.

Graduation Criteria

The main feature is still consider experimental and under a feature flag. This new mode doesn't need a new flag and can just be enabled together if the main feature is enabled as well.

@vincepri
Copy link
Member

@g-gaston Are you able to move the proposal to a google doc first so we can send it out to community members for review? It'd be also great to present the proposal at the next office hours

@g-gaston
Copy link
Contributor

@vincepri
https://docs.google.com/document/d/1whNhpDpqz3kzCL1JlFh-HcjY7qNEEUOe6fA9jaZgnu0/edit?usp=sharing

I can present it at the meeting, no problem. What's the process for that? Do I need to submit it anywhere prior to the meeting?

@sbueringer
Copy link
Member

@g-gaston You can just add yourself to the Agenda for Wednesday in: https://docs.google.com/document/d/1ushaVqAKYnZ2VN_aa3GyKlS4kEd6bSug13xaXOakAQI/edit#heading=h.pxsq37pzkbdq

If you don't have access to that doc, you can get it by joining the Google group: https://groups.google.com/g/kubernetes-sig-cluster-lifecycle

@g-gaston
Copy link
Contributor

@sbueringer Done, thanks!

@sbueringer
Copy link
Member

/milestone v1.2

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 19, 2022
@fabriziopandini
Copy link
Member

/remove lifecycle-stale
The last time this was discussed in the CAPI office hours people volunteered to move this to a proposal/amendment to the current CRS proposal, let's give them some more time to get this work done

@fabriziopandini
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 22, 2022
@fabriziopandini fabriziopandini added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 29, 2022
@fabriziopandini fabriziopandini removed this from the v1.2 milestone Jul 29, 2022
@fabriziopandini fabriziopandini removed the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 29, 2022
@fabriziopandini
Copy link
Member

/triage accepted

@k8s-ci-robot k8s-ci-robot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Oct 3, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 1, 2023
@fabriziopandini
Copy link
Member

/assign @g-gaston
/lifecycle active

@k8s-ci-robot k8s-ci-robot added lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 2, 2023
@jessehu
Copy link
Contributor

jessehu commented Jan 9, 2023

👍 Will the PR #7497 be delivered in the coming 1.3.x release?

@sbueringer
Copy link
Member

sbueringer commented Jan 9, 2023

No, as features are not covered by our backport policy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants