Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-define the scope of clusterctl move #3354

Closed
fabriziopandini opened this issue Jul 16, 2020 · 13 comments
Closed

Re-define the scope of clusterctl move #3354

fabriziopandini opened this issue Jul 16, 2020 · 13 comments
Labels
area/clusterctl Issues or PRs related to clusterctl kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Milestone

Comments

@fabriziopandini
Copy link
Member

Detailed Description

Clusterctl move was introduced in v0.3 as an evolution of pivot command but with a smaller scope that the original command.

The moving process was designed around the bootstrap use case, and it relies on three core ideas:

  1. identify all the object candidate for move, using as a starting point the list of CRDs installed by clusterctl + Secrets & ConfigMaps.
  2. use the ownerRef chain to filter out the exact list of objects to be moved, using the Cluster object as a root.
  3. use the ownerRef chain to define the order for creating and deleting objects.

However, recently a set of new requirements are being presented, so, considering also the discussions about v1alpha4 roadmap, it is now the right time for considering if the scope of this command should be re-defined.

Following topics should be addressed IMO:

What to include in a move operation

The initial design was conceived in order to move all the Clusters in a namespace.
Recently this was extended for including ClusterResourseSet as a root (#3243), there is a WIP PR for identity principals (#3254) and a PR for moving objects which are not related to Cluster API provider (#3337)

  • Should we definitely drop the idea to support moving a single Cluster, given the recent changes?
  • Should we support move Cluster in all namespaces in a single operation?
  • What is the contract around moving global objects (not in a namespace)? What happens if they are used by two namespaces? Should we leave them in the original cluster? What happen if they get moved two times?
  • What is the contract around moving objects which are not related to Cluster API provider? What happens if they are used by two namespaces? Should we leave them in the original cluster? What happens if we are moving an object while it is being reconciled?
  • Should we continue to use the ownerRef chain to identify the exact list of objects to be moved or find a more generic mechanism that does not requires a code change every time a new object gets into the scope? If not how can we exclude e.g. Secrets or ConfigMaps not linked to any cluster?
  • How can we ensure that the entire tree of objects which are not related to Cluster API provider is moved?
  • Should we include moving providers in the scope?

What use case are we covering

  • Should we support move after the initial bootstrap? What are the different requirements for this use case?

/kind feature

@wfernandes @vincepri

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 16, 2020
@ncdc ncdc added this to the v0.4.0 milestone Jul 16, 2020
@detiber
Copy link
Member

detiber commented Jul 16, 2020

Detailed Description

Clusterctl move was introduced in v0.3 as an evolution of pivot command but with a smaller scope that the original command.

The moving process was designed around the bootstrap use case, and it relies on three core ideas:

  1. identify all the object candidate for move, using as a starting point the list of CRDs installed by clusterctl + Secrets & ConfigMaps.
  2. use the ownerRef chain to filter out the exact list of objects to be moved, using the Cluster object as a root.
  3. use the ownerRef chain to define the order for creating and deleting objects.

However, recently a set of new requirements are being presented, so, considering also the discussions about v1alpha4 roadmap, it is now the right time for considering if the scope of this command should be re-defined.

Following topics should be addressed IMO:

What to include in a move operation

The initial design was conceived in order to move all the Clusters in a namespace.
Recently this was extended for including ClusterResourseSet as a root (#3243), there is a WIP PR for identity principals (#3254) and a PR for moving objects which are not related to Cluster API provider (#3337)

  • Should we definitely drop the idea to support moving a single Cluster, given the recent changes?

I could potentially see a two different use cases for move:

  • I want to move management of this single Cluster to a different management cluster
    • not completely sure this is needed
  • I want to move management of all Clusters to a different management cluster
    • might want to limit to a single namespace if using namespaces for multi-tenancy
    • applies to "pivot" use case for bootstrapping
    • applies to the "pivot" use case for deleting a management cluster
    • applies to the general use case of wanting to migrate where the management cluster is running
  • Should we support move Cluster in all namespaces in a single operation?

I think so, otherwise we are deferring the complexity of doing so onto the user.

  • What is the contract around moving global objects (not in a namespace)? What happens if they are used by two namespaces? Should we leave them in the original cluster? What happen if they get moved two times?

If we do not move them for the user, then we should block operations if they do not already exist, otherwise we will create a situation where the management cluster behaves very differently pre and post move.

If we do decide that we should delete them as part of the move, we should probably only do so after they are no longer in use by any Clusters that remain in the source management cluster. Ideally I'd like to rely on ownerReferences for this, but I don't think we can since they do not have a Namespace field. We also cannot rely on finalizers since an owning controller would not be blocked on performing reconciliation of deletion. Which leaves two options:

  • Make it part of the contract that any controllers for global resources that should be moved block deletion if any Clusters they apply to still exist (pushes the responsibility on external tooling to solve the problem, some of which may not know anything about Cluster API)
  • Somehow add logic into clusterctl move to do the right thing and only delete if nothing is referencing it (will be difficult since there is no way to necessarily know how global resources are linked to Clusters, at least for global resources that are not defined in-tree for Cluster API)

Given the above, I think we should probably support deletion for resources that we explicitly know about (defined in-tree for Cluster API), and defer handling of other global resources to provider (or integration) specific post-move instructions.

It might be good to start thinking longer term if we could support some of this through a plugin mechanism, though. Similar to the provider-specific pre-requisites, it would be nice to have a way to automate some of these provider-specific steps, but we should not necessarily own the implementation of them.

  • What is the contract around moving objects which are not related to Cluster API provider? What happens if they are used by two namespaces? Should we leave them in the original cluster? What happens if we are moving an object while it is being reconciled?

We probably need to allow for providers/integrations to define pre (and post) steps needed around the use of clusterctl move.

That way they can specify any pre actions needed to "pause" reconciliation during the move and any "post" actions needed to resume reconciliation and possibly clean up resources left behind.

  • Should we continue to use the ownerRef chain to identify the exact list of objects to be moved or find a more generic mechanism that does not requires a code change every time a new object gets into the scope? If not how can we exclude e.g. Secrets or ConfigMaps not linked to any cluster?

I think use of the ownerRef chain is good for any resources that are namespace scoped and exist in the same namespace, but it definitely breaks down for global and cross-namespace references.

I don't think we should move away from the use of ownerRef, but we should likely explore an alternative that could be used in a common way for any global resources.

I'm not sure that we should support cross-namespace references unless there is a specific use case that is defined that we agree to support, though.

  • How can we ensure that the entire tree of objects which are not related to Cluster API provider is moved?

I don't necessarily think that we can. We can define a convention for supported migration, but can likely leave the rest to pre/post steps needed for a specific provider/integration which may need it. In those cases, I'm wondering if it would help to add a flag to the move command to skip unpausing of the resources until the post steps can be done. We might need a pause/unpause command that can be run after any post steps are run to resume reconciliation of resources, though.

  • Should we include moving providers in the scope?

I don't think we should support moving the provider controllers, but to the extent that we can we should move provider resources.

Alternatively, going back to the idea of a plugin system, we could define provider-specific hooks that a provider can optionally implement, which would allow them to handle their resources in the appropriate ways.

What use case are we covering

  • Should we support move after the initial bootstrap? What are the different requirements for this use case?

There are two that I can see:

  • management cluster deletion, since not supporting this would result in orphaned resources related to the self-managed Cluster
  • I want to move management of my clusters to another cluster for
    • This could be that the hardware the management cluster is running on is being decommissioned or they just want to change where the management cluster is running (previously on a machine-based infrastructure, but decided to host the management cluster on a managed provider for example)

/kind feature

@wfernandes @vincepri

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 14, 2020
@fabriziopandini
Copy link
Member Author

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 14, 2020
@fabriziopandini
Copy link
Member Author

/area clusterctl

@voor
Copy link
Member

voor commented Jan 27, 2021

  • When we create clusters we use an external tool kapp which creates ConfigMaps linked to the deployment of the Cluster CRs, being able to provide additional ConfigMaps (which might reside in another namespace) would be convenient and helpful for this use case.
  • Sometimes we need to organize clusters in different namespaces from where they were initially created, moving between namespaces would be helpful for this.

@wfernandes
Copy link
Contributor

@voor Would it be possible to mark external objects like those ConfigMaps with the clusterctl.cluster.x-k8s.io/move label?

See https://cluster-api.sigs.k8s.io/clusterctl/provider-contract.html?highlight=move#move for more info

@vincepri
Copy link
Member

@fabriziopandini Do we still need this issue?

@fabriziopandini
Copy link
Member Author

Do we still need this issue?

IMO yes, at least until we can derive some AI from this discussion...

@sedefsavas
Copy link

Moving global objects are needed for CAPA multi-tenancy and we need to backport it to v0.3 as we will support multitenancy in the next release.

CAPZ also needs this support in v0.3 releases IIRC @nader-ziada

@fabriziopandini
Copy link
Member Author

@sedefsavas I suggest to open a separated issue for the specific problem of moving global objects so we can rally on details:

  • What is a generic procedure (that works across providers) that allows to detect which global objects to move?
  • What is the contract around moving global objects?
    • What happen if they already exists in the target cluster?
    • Should we leave them in the original cluster given that they might be used by objects in other namespaces?

@fabriziopandini
Copy link
Member Author

@vincepri I'm +1 for closing this issue given that we have an influx of activities on more detailed use cases:

@vincepri
Copy link
Member

/close

@k8s-ci-robot
Copy link
Contributor

@vincepri: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/clusterctl Issues or PRs related to clusterctl kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

No branches or pull requests

9 participants