A single federated cluster can stop propagation of a type for all clusters if it does not have the specified resource version. #1241

dangorst1066 · 2020-06-30T07:51:01Z

A single federated cluster can stop propagation of a type for all clusters if it does not have a particular resource version.

And a question - any good strategies for handling cluster estates that could have multiple versions of a resource in circulation (e.g. v1beta1 and v1 CRDs)

Editing the target type version in the federated type config to v1beta1 (lowest common denominator) appears to work around this ok (tbc), but it's still worrying a single cluster could stop all federation working - seems like this shouldn't be the expected behaviour.

What happened:

Run a federation control plane at kube version 1.16
Enabled federation of CRDs (v1)
Joined another 1.16 cluster - confirmed CRDs and CRs of that type are being propagated ok
Joined a 1.15 cluster - CRDs+CRs not propagated to the 1.15 cluster (CRDs at version v1beta1). All propagation of CRDs and CRs of the same type stopped working for the 1.16 cluster as well.

Logs for the controller manager show msgs like:

E0630 07:13:47.048845       1 reflector.go:153] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:105: Failed to list apiextensions.k8s.io/v1, Kind=CustomResourceDefinition: the server could not find the requested resource

What you expected to happen:

I expected v1 CRDs not to propagate to the 1.15 cluster, however I did not expect the propagation of all CRDs to all clusters to stop working.

How to reproduce it (as minimally and precisely as possible):

Run a federation control plane at kube version 1.16+
Enabled federation of v1 CRDs
Create a Federated CRD, and a CR of that type with placement that will match all clusters
Join another 1.16 cluster - confirmed CRD and CR are being propagated ok
Join a 1.15 cluster - expect the CRD and CR not to be propagated
Create a new federated CRD, or a CR of the original type - these should still be propagated to the 1.16 cluster but I have observed they are not.

Anything else we need to know?:

Environment:

Kubernetes version 1.16 for the fed control plane, 1.15- for one or more federated clusters
KubeFed version 0.3.0
Scope of installation Cluster
AWS/EKS

/kind bug

The text was updated successfully, but these errors were encountered:

RainbowMango · 2020-06-30T09:47:13Z

@dgorst Thanks for your feedback.
Let me reproduce it locally and then get back to you.

RainbowMango · 2020-07-01T13:22:42Z

@dgorst
Cloud you please help confirm if this the minimum step for reproducing?

Prepare clusters:

[root@ecs-d8b6 kubefed]# kubectl -n kube-federation-system get kubefedclusters
NAME       AGE     READY
cluster1   9d      True // v1.17.4  (apiextensions.k8s.io/v1)  `this is the host cluster`
cluster2   9d      True // v1.17.4 (apiextensions.k8s.io/v1)
cluster3   3h10m   True // v1.15.0 (apiextensions.k8s.io/`v1beta1`)

Operation Steps:

create a CRD, such as crontabs.stable.example.com which apiVersion is apiextensions.k8s.io/v1
enable CRD by command: kubefedctl enable customresourcedefinitions
federate CRD by command: kubefedctl federate crd crontabs.stable.example.com

Result:

[root@ecs-d8b6 kubefed]# kubectl get crds crontabs.stable.example.com --context cluster1
NAME                          CREATED AT
crontabs.stable.example.com   2020-07-01T12:50:31Z
[root@ecs-d8b6 kubefed]# kubectl get crds crontabs.stable.example.com --context cluster2
Error from server (NotFound): customresourcedefinitions.apiextensions.k8s.io "crontabs.stable.example.com" not found
[root@ecs-d8b6 kubefed]# kubectl get crds crontabs.stable.example.com --context cluster3
Error from server (NotFound): customresourcedefinitions.apiextensions.k8s.io "crontabs.stable.example.com" not found

You expected the CRD will be propagated to cluster2 and ignore the cluster3, right?

dangorst1066 · 2020-07-01T17:36:06Z

Yes exactly @RainbowMango 👍

It feels like the blast radius from a single (tbf misconfigured) cluster, should not impact propagation to the good clusters. So in your example, yes I don't expect a v1 CRD in cluster1 to be propagated to cluster3, but I would expect it to continue to be propagated to cluster2.

I mention a CR of the type of the CRD as that would also stop propagating at the point the 1.15 cluster is joined. But it's the same issue I guess (the CRD doesn't get propagated because it can't list v1/crds, so it also can't list that type either)

RainbowMango · 2020-07-03T10:08:28Z

@dgorst
I did some investigation and found that the FederatedCustomResourceDefinition sync controller totally be blocked as one of the informers can't finish its sync process.

The following check keeps failing.
https://github.com/kubernetes-sigs/kubefed/blob/bf67d02369e9b2d93281f8224747b94afab3170e/pkg/controller/sync/controller.go#L235-L238

I agree with you that the propagation process should ignore bad clusters.
Let's see how to solve this.

dangorst1066 · 2020-07-03T10:47:53Z

Thanks @RainbowMango for recreating and confirming 👍

Happy to have a stab at resolving this if that'll help? (caveat: I'm new to the kubefed codebase so may need to reach on slack with some questions though!)

RainbowMango · 2020-07-10T09:47:37Z

I've tried a workaround locally, but the community has discussed a better solution.

@hectorj2f @jimmidyson @irfanurrehman
Could you please take a look? If the solution that changes FederatedTypeConfigStatus OK for you?

irfanurrehman · 2020-07-15T02:57:06Z

@RainbowMango thanks for tracking this. IMO the solution proposed by pmorie as per the link you mentioned is completely legit and can be implemented. As far as I understand @font might not be available to complete it.
@dgorst are you up for taking this task up?

RainbowMango · 2020-07-15T03:23:17Z

Given the implementation is a little bit complicated(API change, controller adopt, testing, etc...), I'd like to set up an umbrella issue and split this to several tasks and then run it by iteration. @dgorst you are welcome and feel free to pick any iterated items you interested in.

How do you say? @irfanurrehman , and If it's ok for you, can you help review the following PRs?

irfanurrehman · 2020-07-15T05:42:54Z

Awsome suggestion @RainbowMango. I can certainly review the same.
If time permits, I will take up some tasks too.

hectorj2f · 2020-07-16T08:48:27Z

Thanks for taking care of this @RainbowMango. It sounds good to me too. Share the action items to see if we can help somehow.

RainbowMango · 2020-07-16T10:06:04Z

Just sent a draft issue #1252. I have started some work locally, so I'll take the first task.
Thanks for your support @irfanurrehman @hectorj2f .

fejta-bot · 2020-10-14T10:23:12Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

jimmidyson · 2020-10-14T11:10:47Z

/remove-lifecycle stale

fejta-bot · 2021-01-12T11:33:55Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

hectorj2f · 2021-01-12T11:46:51Z

/remove-lifecycle stale

fejta-bot · 2021-04-12T12:44:03Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

fejta-bot · 2021-05-12T13:29:37Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

fejta-bot · 2021-06-11T13:53:58Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

k8s-ci-robot · 2021-06-11T13:54:03Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jun 30, 2020

RainbowMango mentioned this issue Jul 9, 2020

Set default log level for kubefed controller manager. #1245

Merged

RainbowMango mentioned this issue Jul 16, 2020

Extend federated type config status umbrella issue #1252

Closed

4 tasks

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 14, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 14, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 12, 2021

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 12, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 12, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 12, 2021

k8s-ci-robot closed this as completed Jun 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A single federated cluster can stop propagation of a type for all clusters if it does not have the specified resource version. #1241

A single federated cluster can stop propagation of a type for all clusters if it does not have the specified resource version. #1241

dangorst1066 commented Jun 30, 2020

RainbowMango commented Jun 30, 2020

RainbowMango commented Jul 1, 2020 •

edited

Loading

dangorst1066 commented Jul 1, 2020

RainbowMango commented Jul 3, 2020

dangorst1066 commented Jul 3, 2020

RainbowMango commented Jul 10, 2020

irfanurrehman commented Jul 15, 2020 •

edited

Loading

RainbowMango commented Jul 15, 2020

irfanurrehman commented Jul 15, 2020

hectorj2f commented Jul 16, 2020

RainbowMango commented Jul 16, 2020

fejta-bot commented Oct 14, 2020

jimmidyson commented Oct 14, 2020

fejta-bot commented Jan 12, 2021

hectorj2f commented Jan 12, 2021

fejta-bot commented Apr 12, 2021

fejta-bot commented May 12, 2021

fejta-bot commented Jun 11, 2021

k8s-ci-robot commented Jun 11, 2021

A single federated cluster can stop propagation of a type for all clusters if it does not have the specified resource version. #1241

A single federated cluster can stop propagation of a type for all clusters if it does not have the specified resource version. #1241

Comments

dangorst1066 commented Jun 30, 2020

RainbowMango commented Jun 30, 2020

RainbowMango commented Jul 1, 2020 • edited Loading

dangorst1066 commented Jul 1, 2020

RainbowMango commented Jul 3, 2020

dangorst1066 commented Jul 3, 2020

RainbowMango commented Jul 10, 2020

irfanurrehman commented Jul 15, 2020 • edited Loading

RainbowMango commented Jul 15, 2020

irfanurrehman commented Jul 15, 2020

hectorj2f commented Jul 16, 2020

RainbowMango commented Jul 16, 2020

fejta-bot commented Oct 14, 2020

jimmidyson commented Oct 14, 2020

fejta-bot commented Jan 12, 2021

hectorj2f commented Jan 12, 2021

fejta-bot commented Apr 12, 2021

fejta-bot commented May 12, 2021

fejta-bot commented Jun 11, 2021

k8s-ci-robot commented Jun 11, 2021

RainbowMango commented Jul 1, 2020 •

edited

Loading

irfanurrehman commented Jul 15, 2020 •

edited

Loading