Continuous ClusterResourceSetStrategy #4807

Promaethius · 2021-06-10T21:31:29Z

User Story

ClusterResourceSets provide a unique Kubernetes experience where a Cluster and its workload can be defined in a single .yaml file, templated out in CI/CD pipelines, or controlled by centralized management infrastructure. However, ApplyOnce falls short in terms of dependent objects and changing application definitions. Adding the option for ContinuousApply opens new strategies for reconciliation, dependency trees, and CI/CD pipelines.

Detailed Description

mode: ContinuousApply performs a hash check on interval for the target object. If the object does not exist on the destination cluster, apply it. (Even though ApplyOnce performs this, the resulting interval is spread out fairly far; example, applying an operator and a crd that the operator creates can take up to 15 minutes with ApplyOnce.) If the object does exist on the destination cluster, calculate a hash for the source object and destination object; if they do not match, apply source object.

Anything else you would like to add:

/kind feature

The text was updated successfully, but these errors were encountered:

sbueringer · 2021-06-11T04:46:47Z

If I get it correctly, we should consolidate this issue and #4799

Promaethius · 2021-06-11T21:15:13Z

@sbueringer huh same day, what are the odds. Yeah basically. I'd like to work on it.

sedefsavas · 2021-06-11T21:16:20Z

cc @gab-satchi

Promaethius · 2021-06-11T21:17:56Z

Semi part of the user story, I'd also like to add a reconciliation interval option for both ApplyOnce and Continuous Apply.

Promaethius · 2021-06-17T01:36:37Z

/assign
Going to start work on this.

vincepri · 2021-07-28T23:31:27Z

/milestone Next

dlipovetsky · 2021-09-13T18:02:43Z

@Promaethius I noticed you assigned yourself a few months ago. Are you still working on this? I'd like to help move this idea forward.

g-gaston · 2021-11-02T19:42:46Z

I opened #5555 and we closed it in favor of this one. Copy-pasting here the interesting bits

As mentioned in the design proposal, ClusterResourceSet's only support the ApplyOnce mode. This makes impossible to update such resources without interacting with the workload clusters directly. It makes cluster maintenance a bit more cumbersome, since objects need to be reapplied on each cluster individually, as opposed to letting cluster-api manage that complexity. Plus it doesn't guarantee that all workload clusters have the same version of such objects.

The VSphere provider right now uses "by default" a ClusterResourceSet for the CPI and CSI. So I believe that when having VSphere production clusters, ClusterResourceSet, even if still experimental, is not necessarily a nice to have feature anymore, but a key component that would benefit from a better lifecycle management.

Another note: I was expecting, as workaround for this issue, that resources would be reapplied when creating a new ClusterResourceSet pointing to the same objects (since this CRS wouldn't have a ClusterResourceSetBinding). However, that doesn't happen because of how this is implemented. I believe the CAEP doesn't specify that objects won't be reapplied if a new CRS is created, but it also doesn't specify the opposite. Is this maybe something you might be open to change while we work on a new mode?

As I said in the original issue, I'm more than happy to take this if no one else has already or help whoever is currently working on it. I'm available to start working on it right away.

g-gaston · 2022-01-31T17:12:06Z

This issue seemed stale so I went ahead and wrote kind of a draft for a proposal. I don't even know if a change like this would require a design proposal and this one is pretty barebones, but I hope it works as a starting point for a conversation.

Maybe this needs to be presented in a community meeting but I thought it was better to post it here first to see if @Promaethius is still working on it and collect other folks thoughts about next steps.

Let me know what y'all think 🙂

@vincepri @dlipovetsky @sbueringer

ClusterResourceSet Reconcile mode

Glossary

Refer to the Cluster API Book Glossary.

Summary

Provide a mechanism for reconciling resources defined in a ClusterResourceSet, after creation, by interacting exclusively with the management cluster.

Motivation

Currently, ClusterResourceSet's only support the ApplyOnce mode.
This makes it impossible to update such resources after creation without interacting with workload clusters directly.
As a result, cluster maintenance becomes a bit cumbersome, since objects need to be reapplied individually on each workload cluster.
Plus it doesn't guarantee that all workload clusters have the same version of such objects at a point in time.

Having a mechanism to reconcile the resources managed by ClusterResourceSet's would make cluster maintenance simpler and more intuitive, being able to just update ConfigMap's and Secret's in the management cluster and letting Cluster API manage the complexity of applying those changes in all targeted clusters. This would facilitate the use of CusterResourceSet's in CI/CD pipelines and/or centralized infrastructure systems.

Moreover, some providers, like vSphere, rely "by default" on a ClusterResourceSet to install vital components for the cluster, like the CPI and CSI. ClusterResourceSet, even if still experimental, have became more than a nice to have feature, it's already a key component in the Cluster API ecosystem.

To achieve this, a new Reconcile mode is introduced for ClusterResourceSet's. For such mode, the controller will reapply the set of resources on the workload clusters when their definition change in the management cluster.

Goals

Provide a way to propagate changes in resources defined in a ClusterResourceSet to all targeted clusters

Non-Goals/Future Work

Detect drift when the resources in the workload clusters are directly modified by an external entity

Reapply resources periodically

Support deletion of resources from clusters

Proposal

This proposal adds a new mode Reconcile to ClusterResourceSet's, that will re-apply a resource in the targeted workload clusters where the resource/s definition's hash (in the management cluster) changes from the last time it was applied.

User Stories

Story 1

As someone using ClusterResourceSet's to install resources in multiple clusters, I want to be able to update those resources by just updating the resource's definitions in the management cluster, so I don't have to manually repeat the update for each targeted cluster.

Story 2

As someone using the default Cluster API vSphere provider template, I want to be able to update the CPI and CSI by just updating the the ConfigMap's and Secret's in the management cluster, so I don't have to manually repeat the apply for each targeted cluster.

Implementation Details/Notes/Constraints

Data model changes to existing API types

The only change is the addition of the Reconcile value to ClusterResourceSet.spec.mode. This is an enum which means the CRD needs to be modified for the OpenAPI spec to allow a new value.
apiVersion: addons.cluster.x-k8s.io/v1alpha3
kind: ClusterResourceSet
metadata:
 name: crs1
 namespace: default
spec:
 mode: "Reconcile"
 clusterSelector:
   matchLabels:
     cni: calico
 resources:
   - name: db-secret
     kind: Secret
   - name: calico-addon
     kind: ConfigMap
Detecting changes

The current implementation of ClusterResourceSet's (with just ApplyOnce mode) already calculates a consistent hash for the resource/s definitions and stores it in the ResourceSetBinding. We will use this same mechanism to detect changes, comparing the hash of the current resource/s definition with the one stored in the ResourceSetBinding.

Note that this hash will change when any of the resources is updated, a resource is added or a resource is removed. This means that all resources in the same ConfigMap or Secret, and not only the one that changed, will be reapplied in any of these 3 cases. It also means that resources removed from ConfigMap or Secret won't be deleted from the target clusters.

In the next before-after example we can see that only one resource has changed (ConfigMap calico-configmap). However, all the 3 resources (calico-secret1, calico-secret2 and calico-configmap) will be reapplied.

Before:
apiVersion: v1
kind: ConfigMap
metadata:
  name: calico-addon
data:
  calico1.yaml: |-
     kind: Secret
     apiVersion: v1
     metadata:
      name: calico-secret1
      namespace: mysecrets
      ---
     kind: Secret
     apiVersion: v1
     metadata:
      name: calico-secret2
      namespace: mysecrets
  calico2.yaml: |-
     kind: ConfigMap
     apiVersion: v1
     metadata:
      name: calico-configmap
      namespace: myconfigmaps
     data:
       key: "original value"
After:
apiVersion: v1
kind: ConfigMap
metadata:
  name: calico-addon
data:
  calico1.yaml: |-
     kind: Secret
     apiVersion: v1
     metadata:
      name: calico-secret1
      namespace: mysecrets
      ---
     kind: Secret
     apiVersion: v1
     metadata:
      name: calico-secret2
      namespace: mysecrets
  calico2.yaml: |-
     kind: ConfigMap
     apiVersion: v1
     metadata:
      name: calico-configmap
      namespace: myconfigmaps
     data:
       key: "value that changed"
Drift

The proposed solution only deals with changes in the resources' definitions and not with changes in the real objects in the workload clusters. If those objects are modified or deleted in the workload clusters, the ClusterResourceSet's controller won't do anything and they will remain unchanged until their definition in the management cluster is updated.

This could potentially be mitigated by:

Implementing a "periodic" reconciliation mode where resources are reapplied with a certain frequency even their hash hasn't changed.

Storing the compounded Generation of the applied objects in the ResourceSetBinding. Since Generation is a monotonically increasing integer, a change in the compounded generation (adding all the Generation fields for all the resources) always means at least one resource changed in the workload cluster. With this mechanism, the hash can be used to detect changes in the resources definition and the compounded generation to detect changes in the actual workload cluster's resource.

Risks and Mitigations

Alternatives

The Alternatives section is used to highlight and record other possible approaches to delivering the value proposed by a proposal.

Upgrade Strategy

This proposal only introduces a new possible value for the Mode field and leaves the current behavior for ApplyOnce untouched, so there is no need to upgrade existing clusters.

Additional Details

Test Plan

Extensive unit testing when applying ClusterResourceSet resources with the new mode. e2e testing as part of the cluster-api e2e test suite.

Graduation Criteria

The main feature is still consider experimental and under a feature flag. This new mode doesn't need a new flag and can just be enabled together if the main feature is enabled as well.

vincepri · 2022-01-31T17:14:24Z

@g-gaston Are you able to move the proposal to a google doc first so we can send it out to community members for review? It'd be also great to present the proposal at the next office hours

g-gaston · 2022-01-31T17:39:37Z

@vincepri
https://docs.google.com/document/d/1whNhpDpqz3kzCL1JlFh-HcjY7qNEEUOe6fA9jaZgnu0/edit?usp=sharing

I can present it at the meeting, no problem. What's the process for that? Do I need to submit it anywhere prior to the meeting?

sbueringer · 2022-01-31T18:04:16Z

@g-gaston You can just add yourself to the Agenda for Wednesday in: https://docs.google.com/document/d/1ushaVqAKYnZ2VN_aa3GyKlS4kEd6bSug13xaXOakAQI/edit#heading=h.pxsq37pzkbdq

If you don't have access to that doc, you can get it by joining the Google group: https://groups.google.com/g/kubernetes-sig-cluster-lifecycle

g-gaston · 2022-01-31T18:29:20Z

@sbueringer Done, thanks!

sbueringer · 2022-02-18T17:21:45Z

/milestone v1.2

k8s-triage-robot · 2022-05-19T18:04:06Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

fabriziopandini · 2022-05-22T09:16:35Z

/remove lifecycle-stale
The last time this was discussed in the CAPI office hours people volunteered to move this to a proposal/amendment to the current CRS proposal, let's give them some more time to get this work done

fabriziopandini · 2022-05-22T09:18:31Z

/remove-lifecycle stale

fabriziopandini · 2022-10-03T17:30:32Z

/triage accepted

k8s-triage-robot · 2023-01-01T17:31:41Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

fabriziopandini · 2023-01-02T09:53:39Z

/assign @g-gaston
/lifecycle active

jessehu · 2023-01-09T11:36:57Z

👍 Will the PR #7497 be delivered in the coming 1.3.x release?

sbueringer · 2023-01-09T11:54:09Z

No, as features are not covered by our backport policy.

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jun 10, 2021

gab-satchi mentioned this issue Jun 16, 2021

Controller secrets not synchronized to worker cluster secrets kubernetes-sigs/cluster-api-provider-vsphere#1191

Closed

k8s-ci-robot assigned Promaethius Jun 17, 2021

k8s-ci-robot added this to the Next milestone Jul 28, 2021

vincepri mentioned this issue Jul 28, 2021

CRS should support more strategies than ApplyOnce #4799

Closed

kbreit mentioned this issue Oct 21, 2021

Support more strategies [UpgradeOnChange] on ClusterResourceSet #5467

Closed

sbueringer mentioned this issue Nov 2, 2021

Reconcile ClusterResourceSets #5555

Closed

vincepri unassigned Promaethius Jan 31, 2022

k8s-ci-robot modified the milestones: Next, v1.2 Feb 18, 2022

killianmuldoon mentioned this issue Feb 22, 2022

✨ Allow ApplyAlways mode for ClusterResourceSets #6188

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 19, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 22, 2022

g-gaston mentioned this issue May 26, 2022

📖 Amend CRS proposal to include Reconcile mode #6555

Merged

fabriziopandini added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 29, 2022

fabriziopandini removed this from the v1.2 milestone Jul 29, 2022

fabriziopandini removed the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 29, 2022

k8s-ci-robot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Oct 3, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 1, 2023

fabriziopandini mentioned this issue Jan 2, 2023

✨ Implement Reconcile mode for ClusterResourceSet #7497

Merged

k8s-ci-robot assigned g-gaston Jan 2, 2023

k8s-ci-robot added lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 2, 2023

sbueringer mentioned this issue Jan 9, 2023

[ClusterResourceSet] Add Sync mode #3447

Closed

k8s-ci-robot closed this as completed in #7497 Feb 19, 2023

Continuous ClusterResourceSetStrategy #4807

Continuous ClusterResourceSetStrategy #4807

Comments

Promaethius commented Jun 10, 2021 • edited Loading

sbueringer commented Jun 11, 2021

Promaethius commented Jun 11, 2021

sedefsavas commented Jun 11, 2021

Promaethius commented Jun 11, 2021

Promaethius commented Jun 17, 2021

vincepri commented Jul 28, 2021

dlipovetsky commented Sep 13, 2021

g-gaston commented Nov 2, 2021

g-gaston commented Jan 31, 2022

ClusterResourceSet Reconcile mode

Glossary

Summary

Motivation

Goals

Non-Goals/Future Work

Proposal

User Stories

Story 1

Story 2

Implementation Details/Notes/Constraints

Data model changes to existing API types

Detecting changes

Drift

Risks and Mitigations

Alternatives

Upgrade Strategy

Additional Details

Test Plan

Graduation Criteria

vincepri commented Jan 31, 2022

g-gaston commented Jan 31, 2022

sbueringer commented Jan 31, 2022

g-gaston commented Jan 31, 2022

sbueringer commented Feb 18, 2022

k8s-triage-robot commented May 19, 2022

fabriziopandini commented May 22, 2022

fabriziopandini commented May 22, 2022

fabriziopandini commented Oct 3, 2022

k8s-triage-robot commented Jan 1, 2023

fabriziopandini commented Jan 2, 2023

jessehu commented Jan 9, 2023

sbueringer commented Jan 9, 2023 • edited Loading

Promaethius commented Jun 10, 2021 •

edited

Loading

sbueringer commented Jan 9, 2023 •

edited

Loading