Stop using full CRD list as a fallback to get ObjectMetadata for UpdateReferenceAPIContract #5686

sbueringer · 2021-11-16T21:38:28Z

Detailed Description

Context: We automatically update references in CAPI to newer versions if there is a newer version with the same contract for a CRD.
Example: The MachineSet controller automatically updates references to a InfrastructureMachineTemplate if there is a newer apiVersion for the InfrastructureMachineTemplate which complies to the same CAPI contract.

This functionality is implemented via the UpdateReferenceAPIContract func. It uses the labels on a CRD to calculate if there is a newer version of a CRD complying to the same CAPI contract.

To avoid retrieving/caching whole CRDs, the func uses GetGVKMetadata to retrieve the PartialObjectMetadata of a CRD. GetGVKMetadata depends on that the CRD is correctly named, i.e. the name is :

fmt.Sprintf("%s.%s", flect.Pluralize(strings.ToLower(gvk.Kind)), gvk.Group)

(e.g. infrastructure.cluster.x-k8s.io/DockerMachineTemplate => dockermachinetemplates.infrastructure.cluster.x-k8s.io)

In cases where the name is different we do a full CRD list and identify the correct CRD by checking the GK inside the CRD. The problem is that in this case the controller is listing/watching/caching all CRDs in the cluster which leads to high memory usage of the controller. This affects the CAPI core and the KCP controller. It looks like the func is not used outside of the core repo: https://cs.k8s.io/?q=UpdateReferenceAPIContract&i=nope&files=&excludeFiles=&repos=

To avoid running into this issue we would propose to drop the fallback mechanism. This is also a forcing function to name CRDs correctly.

WDYT?

Tasks are roughly:

Drop the fallback mechanism in UpdateReferenceAPIContract (probably create a copy of the func, deprecate the old one (?))
Use only the new func in our controllers
Document the change in the migration guide (maybe also in the contract docs (?))

/kind cleanup

The text was updated successfully, but these errors were encountered:

sbueringer · 2021-11-16T21:42:21Z

/cc @fabriziopandini @vincepri @CecileRobertMichon @enxebre

fabriziopandini · 2021-11-17T16:46:41Z

+1, but we should discuss this with providers implementers at the office hours and understand how this forcing function impacts on them in order to figure out how fast this can be rolled out

sbueringer · 2021-11-17T18:31:14Z

Result from the CAPI meeting today:

Tasks:

v1.1:
- Let's use an API reader for the fallback (to avoid memory issues): ⚠️ Update references with an APIReader as fallback #5698
v1.3:
- Add the CRD name restriction to the contract: ⚠️ contract: add CRD naming requirements #7297
- Verify known providers manually or create issues in provider repos, so they can audit their CRDs: https://gist.github.com/sbueringer/8178a1eaff59095a7c56c373da0f187e
- clusterctl should emit a warning (with the hint that support will be dropped in a future release)
- Notify users/providers via mailing list (part of v1.3 release mail)
- e2e testing: would be nice to have some way to report that issues via a test. Options:
  - Some kind of conformance test
  - Maybe just surface the clusterctl warning
  - => I think we don't need this given that even the quickstart won't work anymore after we dropped support for non-compliant CRD names (which we should do early in our v1.4 release cycle)
v1.4
- Drop support for non-compliant CRD names

fabriziopandini · 2022-01-26T14:25:16Z

/milestone v1.2

sbueringer · 2022-01-26T14:28:46Z

Short update: I will definitely continue with this issue in v1.2. It's just relatively far down on the priority queue.

sbueringer · 2022-02-17T18:38:00Z

/assign

k8s-triage-robot · 2022-05-18T18:55:08Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

fabriziopandini · 2022-05-18T22:14:53Z

/lifecycle frozen

sbueringer · 2022-09-06T14:46:41Z

I implemented a unit test on https://github.com/sbueringer/cluster-api/blob/poc-clusterctl-crd/cmd/clusterctl/client/config_test.go#L92-L141 to check if at least all builtin providers are using correct CRD names. It turns out almost all of them do, only one single CRD is wrong:

the metal3datas.infrastructure.cluster.x-k8s.io CRD should actually be called metal3data.infrastructure.cluster.x-k8s.io

On one side that is good news on the other side it brings up the question how this CRD could be renamed. Names of resources are immutable. This means the only way to rename the resource is to delete the CRD (and thus the corresponding CRs and then create it again. Which means you're losing all your CRs.

I'm not sure if that is something that we can force providers (metal3 + potentially providers we don't know about) to do at this point.

Does anyone have other ideas how CRDs could be renamed without losing CRs?

I think otherwise we have to close this issue and live with our current implementation.
This essentially means that as soon as providers are using incorrectly named CRDs the performance degrades
as our controller is falling back to the APIReader to get the CRD whenever references are bumped (in KCP, ClusterClass, Cluster, MD, MS, Machine).

sbueringer · 2022-09-06T14:48:53Z

@fabriziopandini @CecileRobertMichon @enxebre @vincepri Any opinions / suggestions?

furkatgofurov7 · 2022-09-06T15:11:40Z

I implemented a unit test on https://github.com/sbueringer/cluster-api/blob/poc-clusterctl-crd/cmd/clusterctl/client/config_test.go#L92-L141 to check if at least all builtin providers are using correct CRD names. It turns out almost all of them do, only one single CRD is wrong:
* the `metal3datas.infrastructure.cluster.x-k8s.io` CRD should actually be called `metal3data.infrastructure.cluster.x-k8s.io`

@sbueringer I just checked it in metal3 provider repo, I think we do define the kind for that specific CRD correctly https://github.com/metal3-io/cluster-api-provider-metal3/blob/main/config/crd/bases/infrastructure.cluster.x-k8s.io_metal3datas.yaml#L14, would you mind pointing out where pluralization is coming incorrectly? AFAIU flect.Pluralize is getting Kind (Metal3Data), lower it(metal3data) and pluralize (metal3datas) the same way outlined in the issue description for CAPD CRD or am I missing something?
Sorry for bumping into this, could not leave it as is and wanted to know if something has to be done/fixed on provider side.

sbueringer · 2022-09-06T15:29:52Z

No worries. I think the issue is that flect.Pluralize pluralizes metal3data as metal3data without s.

Afaik our assumption is that kubebuilder is also using flect.Pluralize which would help if the CRDs are scaffolded with kubebuilder but otherwise it's hard to know when creating them manually.

Btw kind and group is absolutely correct, it's literally just the metadata.name of the CRD.

furkatgofurov7 · 2022-09-06T16:52:49Z

No worries. I think the issue is that flect.Pluralize pluralizes metal3data as metal3data without s.

ou yeah, since data is already in a plural format 😀

fabriziopandini · 2022-09-12T10:27:26Z

@fabriziopandini @CecileRobertMichon @enxebre @vincepri Any opinions / suggestions?

I think we should continue with the plan described in #5686 (comment), add documentation, create awareness via email/reminder in the office hours, add warnings, and sometime in the future, we can finally drop support for the fallback mechanism.

sbueringer · 2022-09-12T10:48:50Z

I'll take another look today. I just realized that the metal3 CR might even be okay (as we probably use the relevant func only under certain circumstances)

sbueringer · 2022-09-28T13:14:51Z

Thinking about "clusterctl should emit a warning (with the hint that support will be dropped in a future release)"

The idea is to add a pre-check to clusterctl init & clusterctl upgrade (any other commands?) to emit a warning when CRDs with invalid names are deployed.

It's easy to identify an invalid name, but not all CRDs deployed by providers have to comply with our naming conventions. Only the ones referenced by ClusterClass, Cluster, KCP, MD, MS, MachinePool, Machine. This includes:

infrastructure provider:
- InfrastructureCluster
- InfrastructureClusterTemplate
- InfrastructureMachine
- InfrastructureMachineTemplate
- InfrastructureMachinePool
bootstrap provider:
- BootstrapConfig
- BootstrapConfigTemplate
control plane provider
- ControlPlane
- ControlPlaneTemplate

To avoid false positives it would be nice to only validate these types of CRDs. The problem is there is no way to identify them. We could try to only validate CRDs which:

have a corresponding Template CRD
end with *MachinePool

This still leaves room for false positives:

not sure if InfraMachinePool resources have to end with MachinePool
there could be provider CRDs like Metal3Data and Metal3DataTemplate which are not used by Cluster API despite their names
not every provider necessarily supports ClusterClass, so InfrastructureClusterTemplate and ControlPlaneTemplate might not exist

Considering all this, I would just validate the names of all CRDs and print a warning which states when exactly the CRD is really problematic. Over time we can add known false positives to an allow list of the precheck (e.g. metal3data) to avoid the warning.

@fabriziopandini WDYT?

fabriziopandini · 2022-09-28T19:30:38Z

what about validating all CRDs except ones with a well know annotation, so provider authors can "silence" the warning without sending a PR in CAPI

note: I think this is acceptable because research work on all the listed providers provided evidences that there are very few (one) CRD not compliant with the rule

sbueringer · 2022-09-28T19:42:18Z

what about validating all CRDs except ones with a well know annotation, so provider authors can "silence" the warning without sending a PR in CAPI

Sure, can do.

fabriziopandini · 2022-09-30T19:59:21Z

/triage accepted

k8s-ci-robot added the kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. label Nov 16, 2021

sbueringer mentioned this issue Nov 16, 2021

🐛 Add proper rbac rule in KCP controller for CRD #5675

Merged

sbueringer mentioned this issue Nov 18, 2021

⚠️ Update references with an APIReader as fallback #5698

Merged

k8s-ci-robot added this to the v1.2 milestone Jan 26, 2022

k8s-ci-robot assigned sbueringer Feb 17, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 18, 2022

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 18, 2022

sbueringer mentioned this issue Jun 23, 2022

🌱 Improve dry run for server side apply using an annotation #6709

Closed

fabriziopandini added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 29, 2022

fabriziopandini removed this from the v1.2 milestone Jul 29, 2022

fabriziopandini removed the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 29, 2022

sbueringer mentioned this issue Sep 27, 2022

⚠️ contract: add CRD naming requirements #7297

Merged

k8s-ci-robot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Sep 30, 2022

sbueringer mentioned this issue Nov 7, 2022

🌱 clusterctl: implement CRD name precheck #7506

Merged

sbueringer mentioned this issue Feb 1, 2023

⚠️ Stop supporting CRDs with invalid names #8041

Merged

k8s-ci-robot closed this as completed in #8041 Feb 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop using full CRD list as a fallback to get ObjectMetadata for UpdateReferenceAPIContract #5686

Stop using full CRD list as a fallback to get ObjectMetadata for UpdateReferenceAPIContract #5686

sbueringer commented Nov 16, 2021 •

edited

Loading

sbueringer commented Nov 16, 2021

fabriziopandini commented Nov 17, 2021

sbueringer commented Nov 17, 2021 •

edited

Loading

fabriziopandini commented Jan 26, 2022

sbueringer commented Jan 26, 2022

sbueringer commented Feb 17, 2022

k8s-triage-robot commented May 18, 2022

fabriziopandini commented May 18, 2022

sbueringer commented Sep 6, 2022 •

edited

Loading

sbueringer commented Sep 6, 2022 •

edited

Loading

furkatgofurov7 commented Sep 6, 2022 •

edited

Loading

sbueringer commented Sep 6, 2022 •

edited

Loading

furkatgofurov7 commented Sep 6, 2022

fabriziopandini commented Sep 12, 2022

sbueringer commented Sep 12, 2022

sbueringer commented Sep 28, 2022 •

edited

Loading

fabriziopandini commented Sep 28, 2022

sbueringer commented Sep 28, 2022

fabriziopandini commented Sep 30, 2022

Stop using full CRD list as a fallback to get ObjectMetadata for UpdateReferenceAPIContract #5686

Stop using full CRD list as a fallback to get ObjectMetadata for UpdateReferenceAPIContract #5686

Comments

sbueringer commented Nov 16, 2021 • edited Loading

sbueringer commented Nov 16, 2021

fabriziopandini commented Nov 17, 2021

sbueringer commented Nov 17, 2021 • edited Loading

fabriziopandini commented Jan 26, 2022

sbueringer commented Jan 26, 2022

sbueringer commented Feb 17, 2022

k8s-triage-robot commented May 18, 2022

fabriziopandini commented May 18, 2022

sbueringer commented Sep 6, 2022 • edited Loading

sbueringer commented Sep 6, 2022 • edited Loading

furkatgofurov7 commented Sep 6, 2022 • edited Loading

sbueringer commented Sep 6, 2022 • edited Loading

furkatgofurov7 commented Sep 6, 2022

fabriziopandini commented Sep 12, 2022

sbueringer commented Sep 12, 2022

sbueringer commented Sep 28, 2022 • edited Loading

fabriziopandini commented Sep 28, 2022

sbueringer commented Sep 28, 2022

fabriziopandini commented Sep 30, 2022

sbueringer commented Nov 16, 2021 •

edited

Loading

sbueringer commented Nov 17, 2021 •

edited

Loading

sbueringer commented Sep 6, 2022 •

edited

Loading

sbueringer commented Sep 6, 2022 •

edited

Loading

furkatgofurov7 commented Sep 6, 2022 •

edited

Loading

sbueringer commented Sep 6, 2022 •

edited

Loading

sbueringer commented Sep 28, 2022 •

edited

Loading