-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CloudInit controller #15
Conversation
I'm wondering if it's worth writing a follow-up PR to add an admission controller that ensures the document is parsable with |
ping |
Hi @connorkuehl, I will take a look at this PR next week. Sorry for the late update. |
ping |
Thanks for the review, @Vicente-Cheng! 🙂 For the comments I have left open and haven't responded to, I am working on addressing in the next revision. |
ping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall lgtm, but we still need to discuss the OnRemove
part.
@Vicente-Cheng @connorkuehl i dont think we should be second guessing the rollback by performing some operations during afaik, apart from the chroot stage, all commands are applied on top of the base image but not persisted to the image itself. They just get re-applied each time on boot. So removing the file should rollback the effect of those changes? We should advise the users that to rollback the effect of earlier changes they will need to manually submit another CRD which contains steps to reverse the original behavior. |
@ibrokethecloud @Vicente-Cheng The controller doesn't automatically apply the changes, right now it assumes the user will reboot the node after making changes to the CRD, so under that model just removing the CRD and rebooting the node should be sufficient, right? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this new feature.
A few questions / suggestions:
(1) Add an EPIC to include:
New CRD & controller deserve a HEP to describe the user-story, specification, test plan ...
Update Harveter DOC about feature usage, troubleshooting ...
Potential UI enhancement
Possible webhook to check file name / format
(2) Those internal configuration files are very important for the Harvester cluster to work correctly, malformed files may cause cluster to fail to start. This feature will expose those files to be manipulated officially, it is important to make sure the exposed file path & name & contents, at the first stage, allowing specific files will be more secure.
if err != nil { | ||
return err | ||
} | ||
defer os.RemoveAll(tempFile.Name()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should those 2 still be deferred after os.Rename
?
tempFile.Close()
and then os.Rename
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is harmless during the success path and it helps tidy things up if the io.Copy fails.
Hi @connorkuehl, |
Other than the minor feedback, the changes work as expected. I was able to deploy a custom cloud-init to a 2 node cluster. sample cloud-init is as follows:
cloud-init reconciles correctly, and can see changes on node2 only
no changes are deployed to node1
cloudinit object reflects the status correctly:
modifications to the file are detected and reconciled as expected
when a file is removed and resynced by the controller, should the timestamp on the conditions change to reflect the last action? currently the condition timestamps are not updated. |
@ibrokethecloud thanks for the review! I will incorporate them after I get the webhook patches merged. 🙂 edit:
Yeah I can add that behavior as well |
Hey everyone, Thanks for sticking with me 😅. The HEP has been reviewed, and the webhook that fell out of it has been merged. The notable changes to this patch since you last saw it is directly addressing feedback, but there are a couple of behavioral changes that emerged since the last version you've reviewed due to requirements that we drew from the HEP. The first change is that the controller checks the CloudInit Spec to see if it has Paused: true. If so, it does not attempt to reconcile, it just makes sure the Status is up-to-date. The second change is that the controller now creates Events when it overwrites or removes a file from a Harvester host. It looks a little something like this:
Note that if you want to try this out for yourself, it is a little more involved this time around because my PR to enable image builds/pushes of the webhook is not merged yet, and I am actively working on updating node-manager's Helm chart to include the webhook. That means that if you want to try it out, you'll need to do some setup:
Then you can check out this branch, run and don't forget to update the webhook deployment to use the imported image as well as the node-manager daemonset and then restart their rollouts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm thanks!
Generally works well. Leave some comments on the above. Tested with your node-manager/node-manager-webhook test-cloudinit CR
Check on the
After reboot, check the /tmp/hello.txt
Also, check event work as well. |
The CloudInit controller will reconcile CloudInit resources (introduced with previous patches to add a webhook for the resource.) It also places an inotify watch on `/oem` so that any local modifications are also subject to reconciliation. Signed-off-by: Connor Kuehl <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice work! thanks!
Depends on: #14Problem:
A Harvester cluster operator can currently modify node configuration by SSH'ing into the node, and modifying the relevant Elemental cloud-init configuration file under
/oem
. However, this is incompatible with a GitOps approach to managing the state of the cluster, at least, without writing some code of their own.Solution:
This PR adds a new CloudInit CRD, a controller, and an fsnotify watcher to harvester-node-manager. This allows cluster operators to layer a cloud-init configuration on top and to target specific nodes in the cluster with the
matchSelector
field in the cloud-init spec.For example, to add SSH access to the Rancher user with my SSH key across every node in the cluster, I can represent this like so:
Related Issue: harvester/harvester#3902
Test plan:
Much of the test plan appears in this PR as automated unit tests.
However, to confirm operation on a multi-node cluster, I ran these manual steps:
kubectl apply -f ./manifests/crds/node.harvesterhci.io_cloudinits.yaml
)ctr -n=k8s.io image import /home/rancher/harvester-node-manager.tar
)kubectl patch managedchart harvester -n fleet-local -p='{"spec":{"paused":true}}' --type=merge
)kubectl edit daemonset/harvester-node-manager -n harvester-system
)kubectl rollout restart daemonset/harvester-node-manager -n harvester-system
)/oem
.