Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop using full CRD list as a fallback to get ObjectMetadata for UpdateReferenceAPIContract #5686

Closed
sbueringer opened this issue Nov 16, 2021 · 19 comments · Fixed by #8041
Closed
Assignees
Labels
kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@sbueringer
Copy link
Member

sbueringer commented Nov 16, 2021

Detailed Description

Context: We automatically update references in CAPI to newer versions if there is a newer version with the same contract for a CRD.
Example: The MachineSet controller automatically updates references to a InfrastructureMachineTemplate if there is a newer apiVersion for the InfrastructureMachineTemplate which complies to the same CAPI contract.

This functionality is implemented via the UpdateReferenceAPIContract func. It uses the labels on a CRD to calculate if there is a newer version of a CRD complying to the same CAPI contract.

To avoid retrieving/caching whole CRDs, the func uses GetGVKMetadata to retrieve the PartialObjectMetadata of a CRD. GetGVKMetadata depends on that the CRD is correctly named, i.e. the name is :

fmt.Sprintf("%s.%s", flect.Pluralize(strings.ToLower(gvk.Kind)), gvk.Group)

(e.g. infrastructure.cluster.x-k8s.io/DockerMachineTemplate => dockermachinetemplates.infrastructure.cluster.x-k8s.io)

In cases where the name is different we do a full CRD list and identify the correct CRD by checking the GK inside the CRD. The problem is that in this case the controller is listing/watching/caching all CRDs in the cluster which leads to high memory usage of the controller. This affects the CAPI core and the KCP controller. It looks like the func is not used outside of the core repo: https://cs.k8s.io/?q=UpdateReferenceAPIContract&i=nope&files=&excludeFiles=&repos=

To avoid running into this issue we would propose to drop the fallback mechanism. This is also a forcing function to name CRDs correctly.

WDYT?

Tasks are roughly:

  • Drop the fallback mechanism in UpdateReferenceAPIContract (probably create a copy of the func, deprecate the old one (?))
  • Use only the new func in our controllers
  • Document the change in the migration guide (maybe also in the contract docs (?))

/kind cleanup

@k8s-ci-robot k8s-ci-robot added the kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. label Nov 16, 2021
@sbueringer
Copy link
Member Author

@fabriziopandini
Copy link
Member

+1, but we should discuss this with providers implementers at the office hours and understand how this forcing function impacts on them in order to figure out how fast this can be rolled out

@sbueringer
Copy link
Member Author

sbueringer commented Nov 17, 2021

Result from the CAPI meeting today:

Tasks:

  • v1.1:

  • v1.3:

    • Add the CRD name restriction to the contract: ⚠️ contract: add CRD naming requirements #7297
    • Verify known providers manually or create issues in provider repos, so they can audit their CRDs: https://gist.github.com/sbueringer/8178a1eaff59095a7c56c373da0f187e
    • clusterctl should emit a warning (with the hint that support will be dropped in a future release)
    • Notify users/providers via mailing list (part of v1.3 release mail)
    • e2e testing: would be nice to have some way to report that issues via a test. Options:
      • Some kind of conformance test
      • Maybe just surface the clusterctl warning
      • => I think we don't need this given that even the quickstart won't work anymore after we dropped support for non-compliant CRD names (which we should do early in our v1.4 release cycle)
  • v1.4

    • Drop support for non-compliant CRD names

@fabriziopandini
Copy link
Member

/milestone v1.2

@k8s-ci-robot k8s-ci-robot added this to the v1.2 milestone Jan 26, 2022
@sbueringer
Copy link
Member Author

Short update: I will definitely continue with this issue in v1.2. It's just relatively far down on the priority queue.

@sbueringer
Copy link
Member Author

/assign

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 18, 2022
@fabriziopandini
Copy link
Member

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 18, 2022
@fabriziopandini fabriziopandini added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 29, 2022
@fabriziopandini fabriziopandini removed this from the v1.2 milestone Jul 29, 2022
@fabriziopandini fabriziopandini removed the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 29, 2022
@sbueringer
Copy link
Member Author

sbueringer commented Sep 6, 2022

I implemented a unit test on https://github.com/sbueringer/cluster-api/blob/poc-clusterctl-crd/cmd/clusterctl/client/config_test.go#L92-L141 to check if at least all builtin providers are using correct CRD names. It turns out almost all of them do, only one single CRD is wrong:

  • the metal3datas.infrastructure.cluster.x-k8s.io CRD should actually be called metal3data.infrastructure.cluster.x-k8s.io

On one side that is good news on the other side it brings up the question how this CRD could be renamed. Names of resources are immutable. This means the only way to rename the resource is to delete the CRD (and thus the corresponding CRs and then create it again. Which means you're losing all your CRs.

I'm not sure if that is something that we can force providers (metal3 + potentially providers we don't know about) to do at this point.

Does anyone have other ideas how CRDs could be renamed without losing CRs?

I think otherwise we have to close this issue and live with our current implementation.
This essentially means that as soon as providers are using incorrectly named CRDs the performance degrades
as our controller is falling back to the APIReader to get the CRD whenever references are bumped (in KCP, ClusterClass, Cluster, MD, MS, Machine).

@sbueringer
Copy link
Member Author

sbueringer commented Sep 6, 2022

@fabriziopandini @CecileRobertMichon @enxebre @vincepri Any opinions / suggestions?

@furkatgofurov7
Copy link
Member

furkatgofurov7 commented Sep 6, 2022

I implemented a unit test on https://github.com/sbueringer/cluster-api/blob/poc-clusterctl-crd/cmd/clusterctl/client/config_test.go#L92-L141 to check if at least all builtin providers are using correct CRD names. It turns out almost all of them do, only one single CRD is wrong:

* the `metal3datas.infrastructure.cluster.x-k8s.io` CRD should actually be called `metal3data.infrastructure.cluster.x-k8s.io`

@sbueringer I just checked it in metal3 provider repo, I think we do define the kind for that specific CRD correctly https://github.com/metal3-io/cluster-api-provider-metal3/blob/main/config/crd/bases/infrastructure.cluster.x-k8s.io_metal3datas.yaml#L14, would you mind pointing out where pluralization is coming incorrectly? AFAIU flect.Pluralize is getting Kind (Metal3Data), lower it(metal3data) and pluralize (metal3datas) the same way outlined in the issue description for CAPD CRD or am I missing something?
Sorry for bumping into this, could not leave it as is and wanted to know if something has to be done/fixed on provider side.

@sbueringer
Copy link
Member Author

sbueringer commented Sep 6, 2022

No worries. I think the issue is that flect.Pluralize pluralizes metal3data as metal3data without s.

Afaik our assumption is that kubebuilder is also using flect.Pluralize which would help if the CRDs are scaffolded with kubebuilder but otherwise it's hard to know when creating them manually.

Btw kind and group is absolutely correct, it's literally just the metadata.name of the CRD.

@furkatgofurov7
Copy link
Member

No worries. I think the issue is that flect.Pluralize pluralizes metal3data as metal3data without s.

ou yeah, since data is already in a plural format 😀

@fabriziopandini
Copy link
Member

@fabriziopandini @CecileRobertMichon @enxebre @vincepri Any opinions / suggestions?

I think we should continue with the plan described in #5686 (comment), add documentation, create awareness via email/reminder in the office hours, add warnings, and sometime in the future, we can finally drop support for the fallback mechanism.

@sbueringer
Copy link
Member Author

I'll take another look today. I just realized that the metal3 CR might even be okay (as we probably use the relevant func only under certain circumstances)

@sbueringer
Copy link
Member Author

sbueringer commented Sep 28, 2022

Thinking about "clusterctl should emit a warning (with the hint that support will be dropped in a future release)"

The idea is to add a pre-check to clusterctl init & clusterctl upgrade (any other commands?) to emit a warning when CRDs with invalid names are deployed.

It's easy to identify an invalid name, but not all CRDs deployed by providers have to comply with our naming conventions. Only the ones referenced by ClusterClass, Cluster, KCP, MD, MS, MachinePool, Machine. This includes:

  • infrastructure provider:
    • InfrastructureCluster
    • InfrastructureClusterTemplate
    • InfrastructureMachine
    • InfrastructureMachineTemplate
    • InfrastructureMachinePool
  • bootstrap provider:
    • BootstrapConfig
    • BootstrapConfigTemplate
  • control plane provider
    • ControlPlane
    • ControlPlaneTemplate

To avoid false positives it would be nice to only validate these types of CRDs. The problem is there is no way to identify them. We could try to only validate CRDs which:

  • have a corresponding Template CRD
  • end with *MachinePool

This still leaves room for false positives:

  • not sure if InfraMachinePool resources have to end with MachinePool
  • there could be provider CRDs like Metal3Data and Metal3DataTemplate which are not used by Cluster API despite their names
  • not every provider necessarily supports ClusterClass, so InfrastructureClusterTemplate and ControlPlaneTemplate might not exist

Considering all this, I would just validate the names of all CRDs and print a warning which states when exactly the CRD is really problematic. Over time we can add known false positives to an allow list of the precheck (e.g. metal3data) to avoid the warning.

@fabriziopandini WDYT?

@fabriziopandini
Copy link
Member

what about validating all CRDs except ones with a well know annotation, so provider authors can "silence" the warning without sending a PR in CAPI

note: I think this is acceptable because research work on all the listed providers provided evidences that there are very few (one) CRD not compliant with the rule

@sbueringer
Copy link
Member Author

what about validating all CRDs except ones with a well know annotation, so provider authors can "silence" the warning without sending a PR in CAPI

Sure, can do.

@fabriziopandini
Copy link
Member

/triage accepted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants