attempt to work around kube-vip rbac issues #299

glitchcrab · 2024-10-25T13:25:43Z

towards https://github.com/giantswarm/giantswarm/issues/31986

Currently, kube-vip uses the admin.conf kubeconfig in order to interact with the API. As of kubernetes 1.29, the behaviour of this kubeconfig has changed due to security reasons.

What happens now is that the admin.conf kubeconfig has no clusterrolebinding (and hence no privileges) until the API server is started.

This however leaves us with a chicken and egg issue because kube-vip uses this kubeconfig to talk to the API, but because the admin kubeconfig has the kube-vip floating IP as the API address (and kube-vip needs to start in order to advertise it), it is unable to start and so the API IP never actually gets advertised and cluster bootstrapping fails.

The new behaviour does place a super-admin.conf kubeconfig on the first CP node to be bootstrapped and this kubeconfig has cluster admin permissions (system:masters), however it is not advised to use this kubeconfig with any workloads (and this kubeconfig does not exist on the 2nd and 3rd CP nodes anyway).

The (somewhat hacky) solution in this PR - we initially create the kube-vip static pod manifest with the super-admin kubeconfig which allows bootstrapping of the first CP node to complete and then we use the post kubeadm commands to revert the pod manifest to use the admin.conf kubeconfig. This method has different outcomes depending on which CP node it runs on.

first CP node

static pod is created and uses super-admin to bootstrap
sed command reverts the kubeconfig in the static pod manifest
pod gets recreated
everything continues

other CP nodes

static pod manifest is created but the kubelet cannot start the pod because the super-admin kubeconfig doesn't exist on the nodes
sed command corrects the kubeconfig to use the admin kubeconfig
pod gets created successfully
profit

Yes this is extremely hacky, however there currently doesn't seem to be a better option until kube-vip comes up with a better idea

see kube-vip/kube-vip#684 for discussion

This change successfully created a cluster manually:

➜  KUBECONFIG=/tmp/wealdy.kubeconfig kg no
NAME                        STATUS   ROLES                  AGE     VERSION
wealdy-b6gf6                Ready    control-plane,master   90s     v1.29.10
wealdy-pw2nm                Ready    control-plane,master   30s     v1.29.10
wealdy-worker-xxg4f-2xb9k   Ready    worker                 2m47s   v1.29.10
wealdy-worker-xxg4f-9bvgp   Ready    worker                 2m47s   v1.29.10
wealdy-worker-xxg4f-mcflj   Ready    worker                 2m47s   v1.29.10
wealdy-ztwmh                Ready    control-plane,master   3m30s   v1.29.10

tinkerers-ci · 2024-10-25T13:25:51Z

Note

As this is a draft PR no triggers from the PR body will be handled.

If you'd like to trigger them while draft please add them as a PR comment.

This reverts commit c3b20ff.

glitchcrab · 2024-10-25T19:19:22Z

/run cluster-test-suites

tinkerers-ci · 2024-10-25T19:46:41Z

cluster-test-suites

Run name	`pr-cluster-vsphere-299-cluster-test-suiteskd5gz`
Commit SHA	`5167b26`
Result	Failed ❌

📋 View full results in Tekton Dashboard

Rerun trigger:
/run cluster-test-suites

Tip

To only re-run the failed test suites you can provide a TARGET_SUITES parameter with your trigger that points to the directory path of the test suites to run, e.g. /run cluster-test-suites TARGET_SUITES=./providers/capa/standard to re-run the CAPA standard test suite. This supports multiple test suites with each path separated by a comma.

Gacko

LGTM!

I don't think there's something like yq installed in our images, so sed is probably the best we can do. Could we then at least make sure to only replace this path in the hostPath volume section? Maybe just including path: in the sed would already make it a bit safer. 🙂

glitchcrab · 2024-10-27T16:14:01Z

LGTM!

I don't think there's something like yq installed in our images, so sed is probably the best we can do. Could we then at least make sure to only replace this path in the hostPath volume section? Maybe just including path: in the sed would already make it a bit safer. 🙂

yep i can't use yq (i checked). This sed is safe because there is only one occurrence of super-admin.conf in the file.

glitchcrab · 2024-10-27T16:14:15Z

/run cluster-test-suites

github-actions · 2024-10-27T16:16:09Z

There were differences in the rendered Helm template, please check! ⚠️

Output

=== Differences when rendered with values file helm/cluster-vsphere/ci/test-wc-values.yaml ===

/stringData/content  (v1/Secret/release-name-kubevip-pod)
  ± value change in multiline text (one insert, one deletion)
    -       path: /etc/kubernetes/admin.conf
    +       path: /etc/kubernetes/super-admin.conf # this is reverted to admin.conf by a postKubeadmCommand
  
  

/spec/kubeadmConfigSpec/postKubeadmCommands  (controlplane.cluster.x-k8s.io/v1beta1/KubeadmControlPlane/org-giantswarm/test)
  + one list entry added:
    - "sed --in-place "s|/etc/kubernetes/super-admin.conf|/etc/kubernetes/admin.conf|g" /etc/kubernetes/manifests/kube-vip.yaml"

tinkerers-ci · 2024-10-27T16:41:23Z

cluster-test-suites

Run name	`pr-cluster-vsphere-299-cluster-test-suitesx6ftp`
Commit SHA	`5167b26`
Result	Failed ❌

📋 View full results in Tekton Dashboard

Rerun trigger:
/run cluster-test-suites

Tip

To only re-run the failed test suites you can provide a TARGET_SUITES parameter with your trigger that points to the directory path of the test suites to run, e.g. /run cluster-test-suites TARGET_SUITES=./providers/capa/standard to re-run the CAPA standard test suite. This supports multiple test suites with each path separated by a comma.

Gacko · 2024-10-27T20:53:10Z

Alright! Shoot if I can help you with debugging the tests.

vxav

LGTMIIW

Gacko · 2024-10-28T08:55:27Z

/run cluster-test-suites TARGET_SUITES=./providers/capv/standard

Gacko · 2024-10-28T08:55:50Z

I'm curious if tests fail because of the recently releases CAPV v28.0.1...

tinkerers-ci · 2024-10-28T09:17:49Z

cluster-test-suites

Run name	`pr-cluster-vsphere-299-cluster-test-suitesldrpj`
Commit SHA	`065cb71`
Result	Failed ❌

📋 View full results in Tekton Dashboard

Rerun trigger:
/run cluster-test-suites

Tip

To only re-run the failed test suites you can provide a TARGET_SUITES parameter with your trigger that points to the directory path of the test suites to run, e.g. /run cluster-test-suites TARGET_SUITES=./providers/capa/standard to re-run the CAPA standard test suite. This supports multiple test suites with each path separated by a comma.

glitchcrab · 2024-10-28T10:17:12Z

/run cluster-test-suites TARGET_SUITES=./providers/capv/standard E2E_WC_KEEP=true

AverageMarcus · 2024-10-28T10:22:37Z

E2E_WC_KEEP=true

That doesn't work from CI

glitchcrab · 2024-10-28T11:23:46Z

/run cluster-test-suites TARGET_SUITES=./providers/capv/standard

glitchcrab · 2024-10-28T11:45:33Z

Right, this PR isn't testable using CI because it's trying to deploy a cluster using 1.28.15 - this won't work because the super-admin kubeconfig was introduced from 1.29 onwards. The file exists on disk on a test node:

t-gw82ikvqlskuhcqksj-j2wkp /etc/kubernetes # ls -al super-admin.conf
-rw-r--r--. 1 root root 0 Oct 28 11:27 super-admin.conf

but this is misleading because the file is actually being created by the kube-vip pod manifest configuration. Because of this, the file is actually empty which causes kube-vip to crash:

t-gw82ikvqlskuhcqksj-j2wkp /etc/kubernetes/manifests # crictl logs c81ba23d742df
time="2024-10-28T11:28:08Z" level=info msg="Starting kube-vip.io [v0.8.3]"
time="2024-10-28T11:28:08Z" level=info msg="namespace [kube-system], Mode: [ARP], Features(s): Control Plane:[true], Services:[false]"
time="2024-10-28T11:28:08Z" level=info msg="prometheus HTTP server started"
time="2024-10-28T11:28:08Z" level=warning msg="Node name is missing from the config, fall back to hostname"
time="2024-10-28T11:28:08Z" level=info msg="Using node name [t-gw82ikvqlskuhcqksj-j2wkp]"
panic: invalid configuration: no configuration has been provided, try setting KUBERNETES_MASTER environment variable

goroutine 1 [running]:
github.com/kube-vip/kube-vip/pkg/k8s.newClientset({0x203594c?, 0xd?}, 0x78?, {0xc00055c090, 0xf}, 0x1?)
	/src/pkg/k8s/client.go:27 +0xe9
github.com/kube-vip/kube-vip/pkg/k8s.NewClientset(...)

tinkerers-ci · 2024-10-28T11:45:59Z

cluster-test-suites

Run name	`pr-cluster-vsphere-299-cluster-test-suites4kmcf`
Commit SHA	`065cb71`
Result	Failed ❌

📋 View full results in Tekton Dashboard

Rerun trigger:
/run cluster-test-suites

Tip

To only re-run the failed test suites you can provide a TARGET_SUITES parameter with your trigger that points to the directory path of the test suites to run, e.g. /run cluster-test-suites TARGET_SUITES=./providers/capa/standard to re-run the CAPA standard test suite. This supports multiple test suites with each path separated by a comma.

Gacko · 2024-10-28T12:49:12Z

But how did the upgrade test work then? Or did it even replace the nodes and just went on without? 😬

Gacko · 2024-10-28T12:49:53Z

You can easily test this PR from the open CAPV v29.0.0 PR. 🙂

glitchcrab · 2024-10-28T13:11:21Z

But how did the upgrade test work then? Or did it even replace the nodes and just went on without? 😬

I would hazard a guess that it's because the new node is joined to an existing node and so the chicken and egg problem doesn't exist. The problem this PR addresses is the cluster bootstrapping when the first node in a new cluster comes up - kube-vip needs to talk to the API server in order to coordinate leader election to bring up the VIP, but because the admin.conf doesn't have the required permissions anymore, it fails because of the circular dependency.

In an upgrade situation the cluster is already bootstrapped so the behaviour introduced by this PR will be more like the second situation because the new node won't even have the super-admin.conf file. Upgrading the CP of a single CP node cluster is essentially just adding a new CP node and then taking the old one away and so this problem does not manifest.

glitchcrab · 2024-10-28T15:58:19Z

tests passed after i abused the v29 release PR giantswarm/releases#1459 (comment)

Gacko · 2024-10-28T18:30:23Z

Nice! So we can theoretically merge this, get another patch release and continue on releasing CAPV v29.0.0 with it, right?

glitchcrab · 2024-10-28T20:11:12Z

yep, i'm going to merge this in and release it so as to unblock the release

glitchcrab self-assigned this Oct 25, 2024

glitchcrab requested a review from a team October 25, 2024 13:25

attempt to work around kube-vip rbac issues

13f6a40

glitchcrab force-pushed the fix/kube-vip-1.29 branch from 0f92732 to 13f6a40 Compare October 25, 2024 13:59

glitchcrab added 2 commits October 25, 2024 15:33

drop post kubeadm command for testing

c3b20ff

Revert "drop post kubeadm command for testing"

864fa53

This reverts commit c3b20ff.

glitchcrab marked this pull request as ready for review October 25, 2024 19:54

glitchcrab requested review from njuettner and a team October 25, 2024 20:08

Gacko approved these changes Oct 25, 2024

View reviewed changes

AverageMarcus approved these changes Oct 27, 2024

View reviewed changes

start kube-vip off as super-admin and drop to admin after kubeadm runs

065cb71

glitchcrab force-pushed the fix/kube-vip-1.29 branch from 5167b26 to 065cb71 Compare October 27, 2024 16:15

vxav approved these changes Oct 28, 2024

View reviewed changes

anvddriesch approved these changes Oct 28, 2024

View reviewed changes

glitchcrab added the skip/ci Instructs pr-gatekeeper to ignore any required PR checks label Oct 28, 2024

glitchcrab mentioned this pull request Oct 28, 2024

CAPV: Release v29.0.0. giantswarm/releases#1459

Merged

3 tasks

glitchcrab merged commit 0a17293 into main Oct 28, 2024
13 of 14 checks passed

glitchcrab deleted the fix/kube-vip-1.29 branch October 28, 2024 20:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

attempt to work around kube-vip rbac issues #299

attempt to work around kube-vip rbac issues #299

glitchcrab commented Oct 25, 2024 •

edited

Loading

tinkerers-ci bot commented Oct 25, 2024

glitchcrab commented Oct 25, 2024

tinkerers-ci bot commented Oct 25, 2024

Gacko left a comment •

edited

Loading

glitchcrab commented Oct 27, 2024

glitchcrab commented Oct 27, 2024

github-actions bot commented Oct 27, 2024

tinkerers-ci bot commented Oct 27, 2024

Gacko commented Oct 27, 2024

vxav left a comment

Gacko commented Oct 28, 2024

Gacko commented Oct 28, 2024

tinkerers-ci bot commented Oct 28, 2024

glitchcrab commented Oct 28, 2024

AverageMarcus commented Oct 28, 2024

glitchcrab commented Oct 28, 2024

glitchcrab commented Oct 28, 2024 •

edited

Loading

tinkerers-ci bot commented Oct 28, 2024

Gacko commented Oct 28, 2024 •

edited

Loading

Gacko commented Oct 28, 2024

glitchcrab commented Oct 28, 2024

glitchcrab commented Oct 28, 2024

Gacko commented Oct 28, 2024

glitchcrab commented Oct 28, 2024

attempt to work around kube-vip rbac issues #299

attempt to work around kube-vip rbac issues #299

Conversation

glitchcrab commented Oct 25, 2024 • edited Loading

first CP node

other CP nodes

tinkerers-ci bot commented Oct 25, 2024

glitchcrab commented Oct 25, 2024

tinkerers-ci bot commented Oct 25, 2024

cluster-test-suites

Gacko left a comment • edited Loading

Choose a reason for hiding this comment

glitchcrab commented Oct 27, 2024

glitchcrab commented Oct 27, 2024

github-actions bot commented Oct 27, 2024

tinkerers-ci bot commented Oct 27, 2024

cluster-test-suites

Gacko commented Oct 27, 2024

vxav left a comment

Choose a reason for hiding this comment

Gacko commented Oct 28, 2024

Gacko commented Oct 28, 2024

tinkerers-ci bot commented Oct 28, 2024

cluster-test-suites

glitchcrab commented Oct 28, 2024

AverageMarcus commented Oct 28, 2024

glitchcrab commented Oct 28, 2024

glitchcrab commented Oct 28, 2024 • edited Loading

tinkerers-ci bot commented Oct 28, 2024

cluster-test-suites

Gacko commented Oct 28, 2024 • edited Loading

Gacko commented Oct 28, 2024

glitchcrab commented Oct 28, 2024

glitchcrab commented Oct 28, 2024

Gacko commented Oct 28, 2024

glitchcrab commented Oct 28, 2024

glitchcrab commented Oct 25, 2024 •

edited

Loading

Gacko left a comment •

edited

Loading

glitchcrab commented Oct 28, 2024 •

edited

Loading

Gacko commented Oct 28, 2024 •

edited

Loading