Skip to content
This repository has been archived by the owner on Oct 28, 2024. It is now read-only.

🐛Fix VirtualCluster CRD and reconciler for work in 1.19+1.20 environment #316

Closed

Conversation

m-messiah
Copy link
Contributor

@m-messiah m-messiah commented Oct 5, 2022

What this PR does / why we need it:
When the super cluster has two different versions of API Server (specifically 1.19 and 1.20) it results in errors of vc-manager, as 1.20 added non-nullable nulls kubernetes/kubernetes#95423, while 1.19 can't work with them.

{"level":"error","ts":1664969234.2211943,"logger":"controller-runtime.manager.controller.virtualcluster","msg":"Reconciler error","reconciler group":"tenancy.x-k8s.io","reconciler kind":"VirtualCluster","name":"test","namespace":"default","error":"VirtualCluster.tenancy.x-k8s.io \"test\" is invalid: [status.reason: Invalid value: \"null\": status.reason in body must be of type string: \"null\", status.message: Invalid value: \"null\": status.message in body must be of type string: \"null\", status.phase: Invalid value: \"null\": status.phase in body must be of type string: \"null\"]"}

{"level":"error","ts":1664975149.3489153,"logger":"controller-runtime.manager.controller.virtualcluster","msg":"Reconciler error","reconciler group":"tenancy.x-k8s.io","reconciler kind":"VirtualCluster","name":"test","namespace":"default","error":"VirtualCluster.tenancy.x-k8s.io \"test\" is invalid: status.conditions.status: Invalid value: \"null\": status.conditions.status in body must be of type string: \"null\""}

So the first solution is to set LastTransitionTime to be a pointer (as metav1.Time is not nullable kubernetes/kubernetes#86811
And the second solution is to set status.condition.Status to non-empty string (as it expected initially). I set it always True, as we don't have false conditions, but we could improve it later.

With these two changes vc-manager became working again.

Which issue(s) this PR fixes:
The PR is related to #63, but I am not sure it could fix all the things there

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: m-messiah
Once this PR has been reviewed and has the lgtm label, please assign fei-guo for approval by writing /assign @fei-guo in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 5, 2022
@Fei-Guo
Copy link

Fei-Guo commented Oct 5, 2022

I am a little confused about the problematic version combination of vc, vc-manager, super. Can you provide the details of the following?

vc apiserver version: ?
vc-manager client-go version: ?
super apiserver version: ?

@m-messiah
Copy link
Contributor Author

vc apiserver version: does not matter. Happens with 1.19 and 1.20 as well
vc-manager client-go version: latest commit e55cc6f , v0.21.9 as I see in go.mod
super apiserver version: v1.19.13 + v1.20.15 on several nodes

So the main problem here is we have two apiserver versions in the super cluster during migration upgrade, that leads to incorrectly set CRs (by api 1.20), that are needed to be handled by 1.19.13 apiserver.

@christopherhein
Copy link
Contributor

christopherhein commented Oct 5, 2022

If we don't have mix super cluster apis does this issue still occur? Given this changes the expected types and those are exported this could cause other breakage for users that implement other reconcilers.

@Fei-Guo I assume this is the big concern, if you are using this resource when you update to the latest code it's going to require changing any out-of-tree usage of those status fields, right?

@Fei-Guo
Copy link

Fei-Guo commented Oct 5, 2022

@christopherhein Yes, this is crd field type change without bumping the version. I am not sure if existing CRs need to be converted to make controller or cr list/get work.

@m-messiah m-messiah marked this pull request as draft October 5, 2022 16:22
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 5, 2022
@m-messiah
Copy link
Contributor Author

We found that the problem was deeper than it looked, especially after additional checks that this PR changes did not help.
The root cause of the issue is described here kubernetes/kubernetes#99804 and fixed there kubernetes/kube-openapi#230 and it happened for our cluster where we had couple Kubernetes API Servers built with Golang 1.16, that failed the validation.

The fast and only solution is to rebuilt the server with Go 1.15, as only Kube APIServer 1.21.0+ has the bugfix for 1.16 compatibility included

@m-messiah m-messiah closed this Oct 6, 2022
@m-messiah m-messiah deleted the vc-crd-1-19-and-20 branch October 6, 2022 11:53
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants