Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CloudInit controller #15

Merged
merged 1 commit into from
Jan 23, 2024
Merged

Add CloudInit controller #15

merged 1 commit into from
Jan 23, 2024

Conversation

connorkuehl
Copy link

@connorkuehl connorkuehl commented Sep 6, 2023

Depends on: #14


Problem:

A Harvester cluster operator can currently modify node configuration by SSH'ing into the node, and modifying the relevant Elemental cloud-init configuration file under /oem. However, this is incompatible with a GitOps approach to managing the state of the cluster, at least, without writing some code of their own.

Solution:

This PR adds a new CloudInit CRD, a controller, and an fsnotify watcher to harvester-node-manager. This allows cluster operators to layer a cloud-init configuration on top and to target specific nodes in the cluster with the matchSelector field in the cloud-init spec.

For example, to add SSH access to the Rancher user with my SSH key across every node in the cluster, I can represent this like so:

apiVersion: node.harvesterhci.io/v1beta1
kind: CloudInit
metadata:
  name: ssh-access
spec:
  matchSelector: {}
  filename: 99_ssh.yaml
  contents: |
    stages:
      network:
        - authorized_keys:
            rancher:
              - ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIPCUNsQEnKj0nl1GS07Qr5RDCCbCim4wu06hCzQZDmTk ckuehl@suselaptop

Related Issue: harvester/harvester#3902

Test plan:

Much of the test plan appears in this PR as automated unit tests.

However, to confirm operation on a multi-node cluster, I ran these manual steps:

  1. Apply the CRD (kubectl apply -f ./manifests/crds/node.harvesterhci.io_cloudinits.yaml)
  2. Import a container built from this PR onto all nodes in the cluster (ctr -n=k8s.io image import /home/rancher/harvester-node-manager.tar)
  3. Patch the Harvester ManagedChart (kubectl patch managedchart harvester -n fleet-local -p='{"spec":{"paused":true}}' --type=merge)
  4. Edit the harvester-node-manager deployment to use the container imported in step 2 (kubectl edit daemonset/harvester-node-manager -n harvester-system)
  5. Rollout the new container (kubectl rollout restart daemonset/harvester-node-manager -n harvester-system)
  6. Wait for that to finish
  7. Apply a CloudInit object (like the one listed above)
  8. SSH in, confirm the file has been created under /oem.

@connorkuehl
Copy link
Author

I'm wondering if it's worth writing a follow-up PR to add an admission controller that ensures the document is parsable with yip as well as protecting certain files, like elemental.config, grubenv, harvester.config, install.

@connorkuehl
Copy link
Author

ping

@Vicente-Cheng
Copy link
Contributor

Hi @connorkuehl, I will take a look at this PR next week. Sorry for the late update.

@connorkuehl
Copy link
Author

ping

pkg/apis/node.harvesterhci.io/v1beta1/cloudinit.go Outdated Show resolved Hide resolved
pkg/cloudinit/cloudinit.go Outdated Show resolved Hide resolved
pkg/cloudinit/cloudinit.go Outdated Show resolved Hide resolved
pkg/cloudinit/cloudinit.go Show resolved Hide resolved
pkg/cloudinit/cloudinit.go Show resolved Hide resolved
pkg/monitor/cloudinit.go Outdated Show resolved Hide resolved
pkg/monitor/cloudinit.go Outdated Show resolved Hide resolved
pkg/monitor/cloudinit.go Show resolved Hide resolved
pkg/monitor/cloudinit.go Show resolved Hide resolved
pkg/cloudinit/cloudinit.go Outdated Show resolved Hide resolved
@connorkuehl
Copy link
Author

Thanks for the review, @Vicente-Cheng! 🙂 For the comments I have left open and haven't responded to, I am working on addressing in the next revision.

@connorkuehl
Copy link
Author

ping

Copy link
Contributor

@Vicente-Cheng Vicente-Cheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall lgtm, but we still need to discuss the OnRemove part.

go.mod Outdated Show resolved Hide resolved
pkg/controller/cloudinit/controller.go Show resolved Hide resolved
pkg/controller/cloudinit/controller.go Outdated Show resolved Hide resolved
pkg/controller/cloudinit/controller.go Outdated Show resolved Hide resolved
pkg/monitor/cloudinit.go Outdated Show resolved Hide resolved
@ibrokethecloud
Copy link
Contributor

@Vicente-Cheng @connorkuehl i dont think we should be second guessing the rollback by performing some operations during OnRemove

afaik, apart from the chroot stage, all commands are applied on top of the base image but not persisted to the image itself. They just get re-applied each time on boot. So removing the file should rollback the effect of those changes?

We should advise the users that to rollback the effect of earlier changes they will need to manually submit another CRD which contains steps to reverse the original behavior.

@connorkuehl
Copy link
Author

We should advise the users that to rollback the effect of earlier changes they will need to manually submit another CRD which contains steps to reverse the original behavior.

@ibrokethecloud @Vicente-Cheng The controller doesn't automatically apply the changes, right now it assumes the user will reboot the node after making changes to the CRD, so under that model just removing the CRD and rebooting the node should be sufficient, right?

Copy link
Member

@w13915984028 w13915984028 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this new feature.

A few questions / suggestions:

(1) Add an EPIC to include:
New CRD & controller deserve a HEP to describe the user-story, specification, test plan ...
Update Harveter DOC about feature usage, troubleshooting ...
Potential UI enhancement
Possible webhook to check file name / format

(2) Those internal configuration files are very important for the Harvester cluster to work correctly, malformed files may cause cluster to fail to start. This feature will expose those files to be manipulated officially, it is important to make sure the exposed file path & name & contents, at the first stage, allowing specific files will be more secure.

if err != nil {
return err
}
defer os.RemoveAll(tempFile.Name())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should those 2 still be deferred after os.Rename ?

tempFile.Close() and then os.Rename ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is harmless during the success path and it helps tidy things up if the io.Copy fails.

pkg/cloudinit/cloudinit.go Show resolved Hide resolved
@Vicente-Cheng
Copy link
Contributor

@ibrokethecloud @Vicente-Cheng The controller doesn't automatically apply the changes, right now it assumes the user will reboot the node after making changes to the CRD, so under that model just removing the CRD and rebooting the node should be sufficient, right?

Hi @connorkuehl,
Yes, I thought that would be enough. Does this improvement would have the UI part? If yes, maybe we need to add some tips when the user deletes the cloud-init config.

@ibrokethecloud
Copy link
Contributor

Other than the minor feedback, the changes work as expected. I was able to deploy a custom cloud-init to a 2 node cluster.

sample cloud-init is as follows:

apiVersion: node.harvesterhci.io/v1beta1
kind: CloudInit
metadata:
  name: sample-cloud-init
spec:
  contents: |
    #cloud-config
    write_files:
    - encoding: b64
      content: CiMgVGhpcyBmaWxlIGNvbnRyb2xzIHRoZSBzdGF0ZSBvZiBTRUxpbnV4
      path: /tmp/content
      permissions: "0644"
  matchSelector:
    "kubernetes.io/hostname": "node2"
  filename: "write-sample.yml"
k get nodes
NAME    STATUS   ROLES                       AGE     VERSION
node1   Ready    control-plane,etcd,master   3d20h   v1.26.11+rke2r1
node2   Ready    <none>                      3d20h   v1.26.11+rke2r1

cloud-init reconciles correctly, and can see changes on node2 only

node2:/oem # ls -lart
total 33
drwx------   2 root root 12288 Dec 14 03:24 lost+found
-rw-rw-rw-   1 root root  9463 Dec 14 03:24 90_custom.yaml
-rw-------   1 root root  1747 Dec 14 03:25 harvester.config
-rwxr-xr-x   1 root root   585 Dec 14 03:25 elemental.config
drwxr-xr-x. 22 root root  4096 Dec 14 03:35 ..
-rw-r--r--   1 root root  1024 Dec 14 03:35 grubenv
drwxr-xr-x   2 root root  1024 Dec 14 03:35 install
-rw-------   1 root root   154 Dec 17 23:51 write-sample.yml
drwxr-xr-x   4 root root  1024 Dec 17 23:51 .
node2:/oem # more write-sample.yml
#cloud-config
write_files:
- encoding: b64
  content: CiMgVGhpcyBmaWxlIGNvbnRyb2xzIHRoZSBzdGF0ZSBvZiBTRUxpbnV4
  path: /tmp/content
  permissions: "0644"
node2:/oem #

no changes are deployed to node1

node1:/oem # ls -alrt
total 119
drwx------   2 root root 12288 Dec 14 02:56 lost+found
-rw-rw-rw-   1 root root 98248 Dec 14 02:57 90_custom.yaml
-rw-------   1 root root  1754 Dec 14 02:57 harvester.config
-rwxr-xr-x   1 root root   586 Dec 14 02:57 elemental.config
drwxr-xr-x. 22 root root  4096 Dec 14 03:06 ..
-rw-r--r--   1 root root  1024 Dec 14 03:06 grubenv
drwxr-xr-x   2 root root  1024 Dec 14 03:06 install
-rw-r--r--   1 root root   238 Dec 14 03:15 99_settings.yaml
drwxr-xr-x   4 root root  1024 Dec 14 03:15 .

cloudinit object reflects the status correctly:

status:
  rollouts:
  - conditions:
    - lastTransitionTime: "2023-12-17T23:51:01Z"
      message: MatchSelector does not match Node labels
      reason: CloudInitNotApplicable
      status: "False"
      type: Applicable
    - lastTransitionTime: "2023-12-17T23:51:01Z"
      message: Local file checksum is different than the CloudInit checksum
      reason: CloudInitChecksumMismatch
      status: "True"
      type: OutOfSync
    - lastTransitionTime: "2023-12-17T23:51:01Z"
      message: write-sample.yml is absent from /oem
      reason: CloudInitAbsentFromDisk
      status: "False"
      type: Present
    nodeName: node1
  - conditions:
    - lastTransitionTime: "2023-12-17T23:51:01Z"
      message: ""
      reason: CloudInitApplicable
      status: "True"
      type: Applicable
    - lastTransitionTime: "2023-12-17T23:51:01Z"
      message: Local file checksum is the same as the CloudInit checksum
      reason: CloudInitChecksumMatch
      status: "False"
      type: OutOfSync
    - lastTransitionTime: "2023-12-17T23:51:01Z"
      message: write-sample.yml is present under /oem
      reason: CloudInitPresentOnDisk
      status: "True"
      type: Present
    nodeName: node2

modifications to the file are detected and reconciled as expected

node2:/oem # > write-sample.yml
node2:/oem # ls -alrt
total 33
drwx------   2 root root 12288 Dec 14 03:24 lost+found
-rw-rw-rw-   1 root root  9463 Dec 14 03:24 90_custom.yaml
-rw-------   1 root root  1747 Dec 14 03:25 harvester.config
-rwxr-xr-x   1 root root   585 Dec 14 03:25 elemental.config
drwxr-xr-x. 22 root root  4096 Dec 14 03:35 ..
-rw-r--r--   1 root root  1024 Dec 14 03:35 grubenv
drwxr-xr-x   2 root root  1024 Dec 14 03:35 install
-rw-------   1 root root   154 Dec 17 23:55 write-sample.yml
drwxr-xr-x   4 root root  1024 Dec 17 23:55 .
node2:/oem # more write-sample.yml
#cloud-config
write_files:
- encoding: b64
  content: CiMgVGhpcyBmaWxlIGNvbnRyb2xzIHRoZSBzdGF0ZSBvZiBTRUxpbnV4
  path: /tmp/content
  permissions: "0644"

when a file is removed and resynced by the controller, should the timestamp on the conditions change to reflect the last action? currently the condition timestamps are not updated.

@connorkuehl
Copy link
Author

connorkuehl commented Dec 18, 2023

@ibrokethecloud thanks for the review! I will incorporate them after I get the webhook patches merged. 🙂

edit:

when a file is removed and resynced by the controller, should the timestamp on the conditions change to reflect the last action? currently the condition timestamps are not updated.

Yeah I can add that behavior as well

@connorkuehl
Copy link
Author

Hey everyone,

Thanks for sticking with me 😅. The HEP has been reviewed, and the webhook that fell out of it has been merged.

The notable changes to this patch since you last saw it is directly addressing feedback, but there are a couple of behavioral changes that emerged since the last version you've reviewed due to requirements that we drew from the HEP.

The first change is that the controller checks the CloudInit Spec to see if it has Paused: true. If so, it does not attempt to reconcile, it just makes sure the Status is up-to-date.

The second change is that the controller now creates Events when it overwrites or removes a file from a Harvester host. It looks a little something like this:

Events:
  Type    Reason                  Age                From                    Message
  ----    ------                  ----               ----                    -------
  Normal  CloudInitFileModified   18m (x2 over 18m)  harvester-node-manager  99_ssh.yaml has been overwritten on harv1
  Normal  CloudInitNotApplicable  2m5s               harvester-node-manager  99_ssh.yaml has been removed from harv1
  Normal  CloudInitNotApplicable  2m5s               harvester-node-manager  99_ssh.yaml has been removed from harv2
  Normal  CloudInitNotApplicable  2m5s               harvester-node-manager  99_ssh.yaml has been removed from harv3
  Normal  CloudInitFileModified   92s                harvester-node-manager  99_ssh.yaml has been overwritten on harv2
  Normal  CloudInitFileModified   92s                harvester-node-manager  99_ssh.yaml has been overwritten on harv3
  Normal  CloudInitFileModified   79s (x2 over 92s)  harvester-node-manager  99_ssh.yaml has been overwritten on harv1

Note that if you want to try this out for yourself, it is a little more involved this time around because my PR to enable image builds/pushes of the webhook is not merged yet, and I am actively working on updating node-manager's Helm chart to include the webhook.

That means that if you want to try it out, you'll need to do some setup:

  1. Apply the CloudInit CRD https://github.com/harvester/node-manager/raw/master/manifests/crds/node.harvesterhci.io_cloudinits.yaml
  2. Apply the updated RBAC in this PR https://github.com/harvester/node-manager/pull/15/files#diff-a3f9e35eaeb026c24daddfa3ae7fce5b17879ad337771f161e655643f1ba1696
  3. Apply the webhook deployment: https://github.com/harvester/node-manager/raw/master/manifests/deployment.yaml
  4. Apply the webhook service: https://github.com/harvester/node-manager/raw/master/manifests/service.yaml

Then you can check out this branch, run make build, make package, make package-webhook and docker image save both of those things, scp them to your Harvester node(s) and import them (ctr -n=k8s.io image import ...)

and don't forget to update the webhook deployment to use the imported image as well as the node-manager daemonset and then restart their rollouts.

Copy link
Member

@w13915984028 w13915984028 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks.

Copy link

@markhillgit markhillgit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm thanks!

manifests/rbac.yaml Show resolved Hide resolved
pkg/controller/cloudinit/controller.go Show resolved Hide resolved
pkg/controller/cloudinit/controller.go Show resolved Hide resolved
pkg/controller/cloudinit/controller.go Show resolved Hide resolved
@Vicente-Cheng
Copy link
Contributor

Generally works well. Leave some comments on the above.

Tested with your node-manager/node-manager-webhook

test-cloudinit CR

apiVersion: node.harvesterhci.io/v1beta1
kind: CloudInit
metadata:
  name: sample-cloud-init
spec:
  contents: |
    #cloud-config
    write_files:
    - content: "Hello cloud-init!"
      path: /tmp/hello.txt
      permissions: "0644"
  matchSelector:
    "kubernetes.io/hostname": "harvester-node-0"
  filename: "sample.yml"

Check on the /oem/ folder

harvester-node-0:~ # cat /oem/sample.yml
#cloud-config
write_files:
- content: "Hello cloud-init!"
  path: /tmp/hello.txt
  permissions: "0644"

After reboot, check the /tmp/hello.txt

harvester-node-0:~ # cat /tmp/hello.txt
Hello cloud-init!

Also, check event work as well.
Thanks for working on that!

The CloudInit controller will reconcile CloudInit resources (introduced
with previous patches to add a webhook for the resource.)

It also places an inotify watch on `/oem` so that any local
modifications are also subject to reconciliation.

Signed-off-by: Connor Kuehl <[email protected]>
@connorkuehl connorkuehl changed the title Add CloudInit CRD and controller Add CloudInit controller Jan 22, 2024
Copy link
Contributor

@ibrokethecloud ibrokethecloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. thanks.

Copy link
Contributor

@Vicente-Cheng Vicente-Cheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work! thanks!

@bk201 bk201 merged commit f1ebee0 into harvester:master Jan 23, 2024
4 checks passed
@connorkuehl connorkuehl deleted the 3902-v2 branch January 23, 2024 13:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants