Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

attempt to work around kube-vip rbac issues #299

Merged
merged 4 commits into from
Oct 28, 2024
Merged

Conversation

glitchcrab
Copy link
Member

@glitchcrab glitchcrab commented Oct 25, 2024

towards https://github.com/giantswarm/giantswarm/issues/31986

Currently, kube-vip uses the admin.conf kubeconfig in order to interact with the API. As of kubernetes 1.29, the behaviour of this kubeconfig has changed due to security reasons.

What happens now is that the admin.conf kubeconfig has no clusterrolebinding (and hence no privileges) until the API server is started.

This however leaves us with a chicken and egg issue because kube-vip uses this kubeconfig to talk to the API, but because the admin kubeconfig has the kube-vip floating IP as the API address (and kube-vip needs to start in order to advertise it), it is unable to start and so the API IP never actually gets advertised and cluster bootstrapping fails.

The new behaviour does place a super-admin.conf kubeconfig on the first CP node to be bootstrapped and this kubeconfig has cluster admin permissions (system:masters), however it is not advised to use this kubeconfig with any workloads (and this kubeconfig does not exist on the 2nd and 3rd CP nodes anyway).

The (somewhat hacky) solution in this PR - we initially create the kube-vip static pod manifest with the super-admin kubeconfig which allows bootstrapping of the first CP node to complete and then we use the post kubeadm commands to revert the pod manifest to use the admin.conf kubeconfig. This method has different outcomes depending on which CP node it runs on.

first CP node

  • static pod is created and uses super-admin to bootstrap
  • sed command reverts the kubeconfig in the static pod manifest
  • pod gets recreated
  • everything continues

other CP nodes

  • static pod manifest is created but the kubelet cannot start the pod because the super-admin kubeconfig doesn't exist on the nodes
  • sed command corrects the kubeconfig to use the admin kubeconfig
  • pod gets created successfully
  • profit

Yes this is extremely hacky, however there currently doesn't seem to be a better option until kube-vip comes up with a better idea

see kube-vip/kube-vip#684 for discussion

This change successfully created a cluster manually:

➜  KUBECONFIG=/tmp/wealdy.kubeconfig kg no
NAME                        STATUS   ROLES                  AGE     VERSION
wealdy-b6gf6                Ready    control-plane,master   90s     v1.29.10
wealdy-pw2nm                Ready    control-plane,master   30s     v1.29.10
wealdy-worker-xxg4f-2xb9k   Ready    worker                 2m47s   v1.29.10
wealdy-worker-xxg4f-9bvgp   Ready    worker                 2m47s   v1.29.10
wealdy-worker-xxg4f-mcflj   Ready    worker                 2m47s   v1.29.10
wealdy-ztwmh                Ready    control-plane,master   3m30s   v1.29.10

@glitchcrab glitchcrab self-assigned this Oct 25, 2024
@glitchcrab glitchcrab requested a review from a team October 25, 2024 13:25
@tinkerers-ci
Copy link

tinkerers-ci bot commented Oct 25, 2024

Note

As this is a draft PR no triggers from the PR body will be handled.

If you'd like to trigger them while draft please add them as a PR comment.

@glitchcrab
Copy link
Member Author

/run cluster-test-suites

@tinkerers-ci
Copy link

tinkerers-ci bot commented Oct 25, 2024

cluster-test-suites

Run name pr-cluster-vsphere-299-cluster-test-suiteskd5gz
Commit SHA 5167b26
Result Failed ❌

📋 View full results in Tekton Dashboard

Rerun trigger:
/run cluster-test-suites


Tip

To only re-run the failed test suites you can provide a TARGET_SUITES parameter with your trigger that points to the directory path of the test suites to run, e.g. /run cluster-test-suites TARGET_SUITES=./providers/capa/standard to re-run the CAPA standard test suite. This supports multiple test suites with each path separated by a comma.

@glitchcrab glitchcrab marked this pull request as ready for review October 25, 2024 19:54
@glitchcrab glitchcrab requested review from njuettner and a team October 25, 2024 20:08
Copy link
Member

@Gacko Gacko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

I don't think there's something like yq installed in our images, so sed is probably the best we can do. Could we then at least make sure to only replace this path in the hostPath volume section? Maybe just including path: in the sed would already make it a bit safer. 🙂

@glitchcrab
Copy link
Member Author

LGTM!

I don't think there's something like yq installed in our images, so sed is probably the best we can do. Could we then at least make sure to only replace this path in the hostPath volume section? Maybe just including path: in the sed would already make it a bit safer. 🙂

yep i can't use yq (i checked). This sed is safe because there is only one occurrence of super-admin.conf in the file.

@glitchcrab
Copy link
Member Author

/run cluster-test-suites

Copy link
Contributor

There were differences in the rendered Helm template, please check! ⚠️

Output
=== Differences when rendered with values file helm/cluster-vsphere/ci/test-wc-values.yaml ===

/stringData/content  (v1/Secret/release-name-kubevip-pod)
  ± value change in multiline text (one insert, one deletion)
    -       path: /etc/kubernetes/admin.conf
    +       path: /etc/kubernetes/super-admin.conf # this is reverted to admin.conf by a postKubeadmCommand
  
  

/spec/kubeadmConfigSpec/postKubeadmCommands  (controlplane.cluster.x-k8s.io/v1beta1/KubeadmControlPlane/org-giantswarm/test)
  + one list entry added:
    - "sed --in-place "s|/etc/kubernetes/super-admin.conf|/etc/kubernetes/admin.conf|g" /etc/kubernetes/manifests/kube-vip.yaml"
    
  

@tinkerers-ci
Copy link

tinkerers-ci bot commented Oct 27, 2024

cluster-test-suites

Run name pr-cluster-vsphere-299-cluster-test-suitesx6ftp
Commit SHA 5167b26
Result Failed ❌

📋 View full results in Tekton Dashboard

Rerun trigger:
/run cluster-test-suites


Tip

To only re-run the failed test suites you can provide a TARGET_SUITES parameter with your trigger that points to the directory path of the test suites to run, e.g. /run cluster-test-suites TARGET_SUITES=./providers/capa/standard to re-run the CAPA standard test suite. This supports multiple test suites with each path separated by a comma.

@Gacko
Copy link
Member

Gacko commented Oct 27, 2024

Alright! Shoot if I can help you with debugging the tests.

Copy link
Contributor

@vxav vxav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTMIIW

@Gacko
Copy link
Member

Gacko commented Oct 28, 2024

/run cluster-test-suites TARGET_SUITES=./providers/capv/standard

@Gacko
Copy link
Member

Gacko commented Oct 28, 2024

I'm curious if tests fail because of the recently releases CAPV v28.0.1...

@tinkerers-ci
Copy link

tinkerers-ci bot commented Oct 28, 2024

cluster-test-suites

Run name pr-cluster-vsphere-299-cluster-test-suitesldrpj
Commit SHA 065cb71
Result Failed ❌

📋 View full results in Tekton Dashboard

Rerun trigger:
/run cluster-test-suites


Tip

To only re-run the failed test suites you can provide a TARGET_SUITES parameter with your trigger that points to the directory path of the test suites to run, e.g. /run cluster-test-suites TARGET_SUITES=./providers/capa/standard to re-run the CAPA standard test suite. This supports multiple test suites with each path separated by a comma.

@glitchcrab
Copy link
Member Author

/run cluster-test-suites TARGET_SUITES=./providers/capv/standard E2E_WC_KEEP=true

@AverageMarcus
Copy link
Member

E2E_WC_KEEP=true

That doesn't work from CI

@glitchcrab
Copy link
Member Author

/run cluster-test-suites TARGET_SUITES=./providers/capv/standard

@glitchcrab
Copy link
Member Author

glitchcrab commented Oct 28, 2024

Right, this PR isn't testable using CI because it's trying to deploy a cluster using 1.28.15 - this won't work because the super-admin kubeconfig was introduced from 1.29 onwards. The file exists on disk on a test node:

t-gw82ikvqlskuhcqksj-j2wkp /etc/kubernetes # ls -al super-admin.conf
-rw-r--r--. 1 root root 0 Oct 28 11:27 super-admin.conf

but this is misleading because the file is actually being created by the kube-vip pod manifest configuration. Because of this, the file is actually empty which causes kube-vip to crash:

t-gw82ikvqlskuhcqksj-j2wkp /etc/kubernetes/manifests # crictl logs c81ba23d742df
time="2024-10-28T11:28:08Z" level=info msg="Starting kube-vip.io [v0.8.3]"
time="2024-10-28T11:28:08Z" level=info msg="namespace [kube-system], Mode: [ARP], Features(s): Control Plane:[true], Services:[false]"
time="2024-10-28T11:28:08Z" level=info msg="prometheus HTTP server started"
time="2024-10-28T11:28:08Z" level=warning msg="Node name is missing from the config, fall back to hostname"
time="2024-10-28T11:28:08Z" level=info msg="Using node name [t-gw82ikvqlskuhcqksj-j2wkp]"
panic: invalid configuration: no configuration has been provided, try setting KUBERNETES_MASTER environment variable

goroutine 1 [running]:
github.com/kube-vip/kube-vip/pkg/k8s.newClientset({0x203594c?, 0xd?}, 0x78?, {0xc00055c090, 0xf}, 0x1?)
	/src/pkg/k8s/client.go:27 +0xe9
github.com/kube-vip/kube-vip/pkg/k8s.NewClientset(...)

@tinkerers-ci
Copy link

tinkerers-ci bot commented Oct 28, 2024

cluster-test-suites

Run name pr-cluster-vsphere-299-cluster-test-suites4kmcf
Commit SHA 065cb71
Result Failed ❌

📋 View full results in Tekton Dashboard

Rerun trigger:
/run cluster-test-suites


Tip

To only re-run the failed test suites you can provide a TARGET_SUITES parameter with your trigger that points to the directory path of the test suites to run, e.g. /run cluster-test-suites TARGET_SUITES=./providers/capa/standard to re-run the CAPA standard test suite. This supports multiple test suites with each path separated by a comma.

@glitchcrab glitchcrab added the skip/ci Instructs pr-gatekeeper to ignore any required PR checks label Oct 28, 2024
@Gacko
Copy link
Member

Gacko commented Oct 28, 2024

But how did the upgrade test work then? Or did it even replace the nodes and just went on without? 😬

@Gacko
Copy link
Member

Gacko commented Oct 28, 2024

You can easily test this PR from the open CAPV v29.0.0 PR. 🙂

@glitchcrab
Copy link
Member Author

But how did the upgrade test work then? Or did it even replace the nodes and just went on without? 😬

I would hazard a guess that it's because the new node is joined to an existing node and so the chicken and egg problem doesn't exist. The problem this PR addresses is the cluster bootstrapping when the first node in a new cluster comes up - kube-vip needs to talk to the API server in order to coordinate leader election to bring up the VIP, but because the admin.conf doesn't have the required permissions anymore, it fails because of the circular dependency.

In an upgrade situation the cluster is already bootstrapped so the behaviour introduced by this PR will be more like the second situation because the new node won't even have the super-admin.conf file. Upgrading the CP of a single CP node cluster is essentially just adding a new CP node and then taking the old one away and so this problem does not manifest.

@glitchcrab
Copy link
Member Author

tests passed after i abused the v29 release PR giantswarm/releases#1459 (comment)

@Gacko
Copy link
Member

Gacko commented Oct 28, 2024

Nice! So we can theoretically merge this, get another patch release and continue on releasing CAPV v29.0.0 with it, right?

@glitchcrab
Copy link
Member Author

yep, i'm going to merge this in and release it so as to unblock the release

@glitchcrab glitchcrab merged commit 0a17293 into main Oct 28, 2024
13 of 14 checks passed
@glitchcrab glitchcrab deleted the fix/kube-vip-1.29 branch October 28, 2024 20:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
skip/ci Instructs pr-gatekeeper to ignore any required PR checks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants