Add CloudInit controller #15

connorkuehl · 2023-09-06T19:46:21Z

~~Depends on: #14~~

Problem:

A Harvester cluster operator can currently modify node configuration by SSH'ing into the node, and modifying the relevant Elemental cloud-init configuration file under /oem. However, this is incompatible with a GitOps approach to managing the state of the cluster, at least, without writing some code of their own.

Solution:

This PR adds a new CloudInit CRD, a controller, and an fsnotify watcher to harvester-node-manager. This allows cluster operators to layer a cloud-init configuration on top and to target specific nodes in the cluster with the matchSelector field in the cloud-init spec.

For example, to add SSH access to the Rancher user with my SSH key across every node in the cluster, I can represent this like so:

apiVersion: node.harvesterhci.io/v1beta1
kind: CloudInit
metadata:
  name: ssh-access
spec:
  matchSelector: {}
  filename: 99_ssh.yaml
  contents: |
    stages:
      network:
        - authorized_keys:
            rancher:
              - ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIPCUNsQEnKj0nl1GS07Qr5RDCCbCim4wu06hCzQZDmTk ckuehl@suselaptop

Related Issue: harvester/harvester#3902

Test plan:

Much of the test plan appears in this PR as automated unit tests.

However, to confirm operation on a multi-node cluster, I ran these manual steps:

Apply the CRD (kubectl apply -f ./manifests/crds/node.harvesterhci.io_cloudinits.yaml)
Import a container built from this PR onto all nodes in the cluster (ctr -n=k8s.io image import /home/rancher/harvester-node-manager.tar)
Patch the Harvester ManagedChart (kubectl patch managedchart harvester -n fleet-local -p='{"spec":{"paused":true}}' --type=merge)
Edit the harvester-node-manager deployment to use the container imported in step 2 (kubectl edit daemonset/harvester-node-manager -n harvester-system)
Rollout the new container (kubectl rollout restart daemonset/harvester-node-manager -n harvester-system)
Wait for that to finish
Apply a CloudInit object (like the one listed above)
SSH in, confirm the file has been created under /oem.

connorkuehl · 2023-09-13T20:31:02Z

I'm wondering if it's worth writing a follow-up PR to add an admission controller that ensures the document is parsable with yip as well as protecting certain files, like elemental.config, grubenv, harvester.config, install.

connorkuehl · 2023-09-25T16:03:26Z

ping

Vicente-Cheng · 2023-09-27T11:07:46Z

Hi @connorkuehl, I will take a look at this PR next week. Sorry for the late update.

connorkuehl · 2023-10-04T16:04:05Z

ping

pkg/apis/node.harvesterhci.io/v1beta1/cloudinit.go

pkg/cloudinit/cloudinit.go

pkg/monitor/cloudinit.go

pkg/cloudinit/cloudinit.go

connorkuehl · 2023-10-05T21:29:02Z

Thanks for the review, @Vicente-Cheng! 🙂 For the comments I have left open and haven't responded to, I am working on addressing in the next revision.

pkg/monitor/cloudinit.go

connorkuehl · 2023-10-31T15:51:31Z

ping

Vicente-Cheng

overall lgtm, but we still need to discuss the OnRemove part.

go.mod

pkg/controller/cloudinit/controller.go

pkg/monitor/cloudinit.go

ibrokethecloud · 2023-11-01T22:04:49Z

@Vicente-Cheng @connorkuehl i dont think we should be second guessing the rollback by performing some operations during OnRemove

afaik, apart from the chroot stage, all commands are applied on top of the base image but not persisted to the image itself. They just get re-applied each time on boot. So removing the file should rollback the effect of those changes?

We should advise the users that to rollback the effect of earlier changes they will need to manually submit another CRD which contains steps to reverse the original behavior.

connorkuehl · 2023-11-02T13:32:16Z

We should advise the users that to rollback the effect of earlier changes they will need to manually submit another CRD which contains steps to reverse the original behavior.

@ibrokethecloud @Vicente-Cheng The controller doesn't automatically apply the changes, right now it assumes the user will reboot the node after making changes to the CRD, so under that model just removing the CRD and rebooting the node should be sufficient, right?

w13915984028

Thanks for adding this new feature.

A few questions / suggestions:

(1) Add an EPIC to include:
New CRD & controller deserve a HEP to describe the user-story, specification, test plan ...
Update Harveter DOC about feature usage, troubleshooting ...
Potential UI enhancement
Possible webhook to check file name / format

(2) Those internal configuration files are very important for the Harvester cluster to work correctly, malformed files may cause cluster to fail to start. This feature will expose those files to be manipulated officially, it is important to make sure the exposed file path & name & contents, at the first stage, allowing specific files will be more secure.

w13915984028 · 2023-11-02T14:34:41Z

pkg/cloudinit/cloudinit.go

+	if err != nil {
+		return err
+	}
+	defer os.RemoveAll(tempFile.Name())


should those 2 still be deferred after os.Rename ?

tempFile.Close() and then os.Rename ?

It is harmless during the success path and it helps tidy things up if the io.Copy fails.

pkg/cloudinit/cloudinit.go

Vicente-Cheng · 2023-11-13T06:36:16Z

@ibrokethecloud @Vicente-Cheng The controller doesn't automatically apply the changes, right now it assumes the user will reboot the node after making changes to the CRD, so under that model just removing the CRD and rebooting the node should be sufficient, right?

Hi @connorkuehl,
Yes, I thought that would be enough. Does this improvement would have the UI part? If yes, maybe we need to add some tips when the user deletes the cloud-init config.

pkg/controller/cloudinit/controller.go

pkg/apis/node.harvesterhci.io/v1beta1/cloudinit.go

ibrokethecloud · 2023-12-18T00:00:14Z

Other than the minor feedback, the changes work as expected. I was able to deploy a custom cloud-init to a 2 node cluster.

sample cloud-init is as follows:

apiVersion: node.harvesterhci.io/v1beta1
kind: CloudInit
metadata:
  name: sample-cloud-init
spec:
  contents: |
    #cloud-config
    write_files:
    - encoding: b64
      content: CiMgVGhpcyBmaWxlIGNvbnRyb2xzIHRoZSBzdGF0ZSBvZiBTRUxpbnV4
      path: /tmp/content
      permissions: "0644"
  matchSelector:
    "kubernetes.io/hostname": "node2"
  filename: "write-sample.yml"

k get nodes
NAME    STATUS   ROLES                       AGE     VERSION
node1   Ready    control-plane,etcd,master   3d20h   v1.26.11+rke2r1
node2   Ready    <none>                      3d20h   v1.26.11+rke2r1

cloud-init reconciles correctly, and can see changes on node2 only

node2:/oem # ls -lart
total 33
drwx------   2 root root 12288 Dec 14 03:24 lost+found
-rw-rw-rw-   1 root root  9463 Dec 14 03:24 90_custom.yaml
-rw-------   1 root root  1747 Dec 14 03:25 harvester.config
-rwxr-xr-x   1 root root   585 Dec 14 03:25 elemental.config
drwxr-xr-x. 22 root root  4096 Dec 14 03:35 ..
-rw-r--r--   1 root root  1024 Dec 14 03:35 grubenv
drwxr-xr-x   2 root root  1024 Dec 14 03:35 install
-rw-------   1 root root   154 Dec 17 23:51 write-sample.yml
drwxr-xr-x   4 root root  1024 Dec 17 23:51 .
node2:/oem # more write-sample.yml
#cloud-config
write_files:
- encoding: b64
  content: CiMgVGhpcyBmaWxlIGNvbnRyb2xzIHRoZSBzdGF0ZSBvZiBTRUxpbnV4
  path: /tmp/content
  permissions: "0644"
node2:/oem #

no changes are deployed to node1

node1:/oem # ls -alrt
total 119
drwx------   2 root root 12288 Dec 14 02:56 lost+found
-rw-rw-rw-   1 root root 98248 Dec 14 02:57 90_custom.yaml
-rw-------   1 root root  1754 Dec 14 02:57 harvester.config
-rwxr-xr-x   1 root root   586 Dec 14 02:57 elemental.config
drwxr-xr-x. 22 root root  4096 Dec 14 03:06 ..
-rw-r--r--   1 root root  1024 Dec 14 03:06 grubenv
drwxr-xr-x   2 root root  1024 Dec 14 03:06 install
-rw-r--r--   1 root root   238 Dec 14 03:15 99_settings.yaml
drwxr-xr-x   4 root root  1024 Dec 14 03:15 .

cloudinit object reflects the status correctly:

status:
  rollouts:
  - conditions:
    - lastTransitionTime: "2023-12-17T23:51:01Z"
      message: MatchSelector does not match Node labels
      reason: CloudInitNotApplicable
      status: "False"
      type: Applicable
    - lastTransitionTime: "2023-12-17T23:51:01Z"
      message: Local file checksum is different than the CloudInit checksum
      reason: CloudInitChecksumMismatch
      status: "True"
      type: OutOfSync
    - lastTransitionTime: "2023-12-17T23:51:01Z"
      message: write-sample.yml is absent from /oem
      reason: CloudInitAbsentFromDisk
      status: "False"
      type: Present
    nodeName: node1
  - conditions:
    - lastTransitionTime: "2023-12-17T23:51:01Z"
      message: ""
      reason: CloudInitApplicable
      status: "True"
      type: Applicable
    - lastTransitionTime: "2023-12-17T23:51:01Z"
      message: Local file checksum is the same as the CloudInit checksum
      reason: CloudInitChecksumMatch
      status: "False"
      type: OutOfSync
    - lastTransitionTime: "2023-12-17T23:51:01Z"
      message: write-sample.yml is present under /oem
      reason: CloudInitPresentOnDisk
      status: "True"
      type: Present
    nodeName: node2

modifications to the file are detected and reconciled as expected

node2:/oem # > write-sample.yml
node2:/oem # ls -alrt
total 33
drwx------   2 root root 12288 Dec 14 03:24 lost+found
-rw-rw-rw-   1 root root  9463 Dec 14 03:24 90_custom.yaml
-rw-------   1 root root  1747 Dec 14 03:25 harvester.config
-rwxr-xr-x   1 root root   585 Dec 14 03:25 elemental.config
drwxr-xr-x. 22 root root  4096 Dec 14 03:35 ..
-rw-r--r--   1 root root  1024 Dec 14 03:35 grubenv
drwxr-xr-x   2 root root  1024 Dec 14 03:35 install
-rw-------   1 root root   154 Dec 17 23:55 write-sample.yml
drwxr-xr-x   4 root root  1024 Dec 17 23:55 .
node2:/oem # more write-sample.yml
#cloud-config
write_files:
- encoding: b64
  content: CiMgVGhpcyBmaWxlIGNvbnRyb2xzIHRoZSBzdGF0ZSBvZiBTRUxpbnV4
  path: /tmp/content
  permissions: "0644"

when a file is removed and resynced by the controller, should the timestamp on the conditions change to reflect the last action? currently the condition timestamps are not updated.

connorkuehl · 2023-12-18T15:15:09Z

@ibrokethecloud thanks for the review! I will incorporate them after I get the webhook patches merged. 🙂

edit:

when a file is removed and resynced by the controller, should the timestamp on the conditions change to reflect the last action? currently the condition timestamps are not updated.

Yeah I can add that behavior as well

connorkuehl · 2024-01-17T23:15:40Z

Hey everyone,

Thanks for sticking with me 😅. The HEP has been reviewed, and the webhook that fell out of it has been merged.

The notable changes to this patch since you last saw it is directly addressing feedback, but there are a couple of behavioral changes that emerged since the last version you've reviewed due to requirements that we drew from the HEP.

The first change is that the controller checks the CloudInit Spec to see if it has Paused: true. If so, it does not attempt to reconcile, it just makes sure the Status is up-to-date.

The second change is that the controller now creates Events when it overwrites or removes a file from a Harvester host. It looks a little something like this:

Events:
  Type    Reason                  Age                From                    Message
  ----    ------                  ----               ----                    -------
  Normal  CloudInitFileModified   18m (x2 over 18m)  harvester-node-manager  99_ssh.yaml has been overwritten on harv1
  Normal  CloudInitNotApplicable  2m5s               harvester-node-manager  99_ssh.yaml has been removed from harv1
  Normal  CloudInitNotApplicable  2m5s               harvester-node-manager  99_ssh.yaml has been removed from harv2
  Normal  CloudInitNotApplicable  2m5s               harvester-node-manager  99_ssh.yaml has been removed from harv3
  Normal  CloudInitFileModified   92s                harvester-node-manager  99_ssh.yaml has been overwritten on harv2
  Normal  CloudInitFileModified   92s                harvester-node-manager  99_ssh.yaml has been overwritten on harv3
  Normal  CloudInitFileModified   79s (x2 over 92s)  harvester-node-manager  99_ssh.yaml has been overwritten on harv1

Note that if you want to try this out for yourself, it is a little more involved this time around because my PR to enable image builds/pushes of the webhook is not merged yet, and I am actively working on updating node-manager's Helm chart to include the webhook.

That means that if you want to try it out, you'll need to do some setup:

Apply the CloudInit CRD https://github.com/harvester/node-manager/raw/master/manifests/crds/node.harvesterhci.io_cloudinits.yaml
Apply the updated RBAC in this PR https://github.com/harvester/node-manager/pull/15/files#diff-a3f9e35eaeb026c24daddfa3ae7fce5b17879ad337771f161e655643f1ba1696
Apply the webhook deployment: https://github.com/harvester/node-manager/raw/master/manifests/deployment.yaml
Apply the webhook service: https://github.com/harvester/node-manager/raw/master/manifests/service.yaml

Then you can check out this branch, run make build, make package, make package-webhook and docker image save both of those things, scp them to your Harvester node(s) and import them (ctr -n=k8s.io image import ...)

and don't forget to update the webhook deployment to use the imported image as well as the node-manager daemonset and then restart their rollouts.

w13915984028

LGTM, thanks.

markhillgit

lgtm thanks!

pkg/controller/cloudinit/controller.go

manifests/rbac.yaml

pkg/controller/cloudinit/controller.go

Vicente-Cheng · 2024-01-19T08:21:09Z

Generally works well. Leave some comments on the above.

Tested with your node-manager/node-manager-webhook

test-cloudinit CR

apiVersion: node.harvesterhci.io/v1beta1
kind: CloudInit
metadata:
  name: sample-cloud-init
spec:
  contents: |
    #cloud-config
    write_files:
    - content: "Hello cloud-init!"
      path: /tmp/hello.txt
      permissions: "0644"
  matchSelector:
    "kubernetes.io/hostname": "harvester-node-0"
  filename: "sample.yml"

Check on the /oem/ folder

harvester-node-0:~ # cat /oem/sample.yml
#cloud-config
write_files:
- content: "Hello cloud-init!"
  path: /tmp/hello.txt
  permissions: "0644"

After reboot, check the /tmp/hello.txt

harvester-node-0:~ # cat /tmp/hello.txt
Hello cloud-init!

Also, check event work as well.
Thanks for working on that!

The CloudInit controller will reconcile CloudInit resources (introduced with previous patches to add a webhook for the resource.) It also places an inotify watch on `/oem` so that any local modifications are also subject to reconciliation. Signed-off-by: Connor Kuehl <[email protected]>

ibrokethecloud

lgtm. thanks.

Vicente-Cheng

nice work! thanks!

connorkuehl mentioned this pull request Sep 7, 2023

[ENHANCEMENT] support elemental cloud-init via harvester-node-manager harvester/harvester#3902

Closed

8 tasks

w13915984028 mentioned this pull request Sep 7, 2023

[FEATURE] Optimize for Frequent Power-off/Power-On operating procedures harvester/harvester#3261

Closed

9 tasks

connorkuehl marked this pull request as ready for review September 13, 2023 20:26

connorkuehl requested review from Vicente-Cheng, bk201, ibrokethecloud and markhillgit September 13, 2023 20:27

Vicente-Cheng reviewed Oct 5, 2023

View reviewed changes

ibrokethecloud reviewed Oct 10, 2023

View reviewed changes

pkg/monitor/cloudinit.go Outdated Show resolved Hide resolved

connorkuehl requested review from ibrokethecloud and Vicente-Cheng October 24, 2023 21:16

Vicente-Cheng reviewed Nov 1, 2023

View reviewed changes

w13915984028 reviewed Nov 2, 2023

View reviewed changes

ibrokethecloud reviewed Dec 17, 2023

View reviewed changes

pkg/controller/cloudinit/controller.go Outdated Show resolved Hide resolved

ibrokethecloud reviewed Dec 17, 2023

View reviewed changes

pkg/apis/node.harvesterhci.io/v1beta1/cloudinit.go Outdated Show resolved Hide resolved

connorkuehl requested review from w13915984028, ibrokethecloud and Vicente-Cheng January 17, 2024 23:16

w13915984028 approved these changes Jan 18, 2024

View reviewed changes

markhillgit approved these changes Jan 18, 2024

View reviewed changes

connorkuehl mentioned this pull request Jan 18, 2024

harvester-node-manager: update RBAC for eventing harvester/charts#206

Merged

1 task

ibrokethecloud reviewed Jan 19, 2024

View reviewed changes

pkg/controller/cloudinit/controller.go Show resolved Hide resolved

ibrokethecloud reviewed Jan 19, 2024

View reviewed changes

pkg/controller/cloudinit/controller.go Outdated Show resolved Hide resolved

Vicente-Cheng reviewed Jan 19, 2024

View reviewed changes

manifests/rbac.yaml Show resolved Hide resolved

pkg/controller/cloudinit/controller.go Show resolved Hide resolved

pkg/controller/cloudinit/controller.go Show resolved Hide resolved

pkg/controller/cloudinit/controller.go Show resolved Hide resolved

connorkuehl requested a review from ibrokethecloud January 22, 2024 16:19

connorkuehl changed the title ~~Add CloudInit CRD and controller~~ Add CloudInit controller Jan 22, 2024

ibrokethecloud approved these changes Jan 23, 2024

View reviewed changes

Vicente-Cheng approved these changes Jan 23, 2024

View reviewed changes

bk201 merged commit f1ebee0 into harvester:master Jan 23, 2024
4 checks passed

connorkuehl deleted the 3902-v2 branch January 23, 2024 13:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CloudInit controller #15

Add CloudInit controller #15

connorkuehl commented Sep 6, 2023 •

edited

Loading

connorkuehl commented Sep 13, 2023

connorkuehl commented Sep 25, 2023

Vicente-Cheng commented Sep 27, 2023

connorkuehl commented Oct 4, 2023

connorkuehl commented Oct 5, 2023

connorkuehl commented Oct 31, 2023

Vicente-Cheng left a comment

ibrokethecloud commented Nov 1, 2023

connorkuehl commented Nov 2, 2023

w13915984028 left a comment

w13915984028 Nov 2, 2023

connorkuehl Jan 17, 2024

Vicente-Cheng commented Nov 13, 2023

ibrokethecloud commented Dec 18, 2023

connorkuehl commented Dec 18, 2023 •

edited

Loading

connorkuehl commented Jan 17, 2024

w13915984028 left a comment

markhillgit left a comment

Vicente-Cheng commented Jan 19, 2024

ibrokethecloud left a comment

Vicente-Cheng left a comment

Add CloudInit controller #15

Add CloudInit controller #15

Conversation

connorkuehl commented Sep 6, 2023 • edited Loading

connorkuehl commented Sep 13, 2023

connorkuehl commented Sep 25, 2023

Vicente-Cheng commented Sep 27, 2023

connorkuehl commented Oct 4, 2023

connorkuehl commented Oct 5, 2023

connorkuehl commented Oct 31, 2023

Vicente-Cheng left a comment

Choose a reason for hiding this comment

ibrokethecloud commented Nov 1, 2023

connorkuehl commented Nov 2, 2023

w13915984028 left a comment

Choose a reason for hiding this comment

w13915984028 Nov 2, 2023

Choose a reason for hiding this comment

connorkuehl Jan 17, 2024

Choose a reason for hiding this comment

Vicente-Cheng commented Nov 13, 2023

ibrokethecloud commented Dec 18, 2023

connorkuehl commented Dec 18, 2023 • edited Loading

connorkuehl commented Jan 17, 2024

w13915984028 left a comment

Choose a reason for hiding this comment

markhillgit left a comment

Choose a reason for hiding this comment

Vicente-Cheng commented Jan 19, 2024

ibrokethecloud left a comment

Choose a reason for hiding this comment

Vicente-Cheng left a comment

Choose a reason for hiding this comment

connorkuehl commented Sep 6, 2023 •

edited

Loading

connorkuehl commented Dec 18, 2023 •

edited

Loading