-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redesign idea: "Central nervous system for Kured" #359
Comments
WDYT of this? |
Sounds good to me. I ended up writing part of something similar to this here: https://github.com/pnnl-miscscripts/miscscripts/tree/master/charts/charts/kubeupdater Where it is just the node side of things, and annotates the node with a label when nodes needing upgrade, and accepts a label to upgrade/reboot the node after the node has been drained. It unlabels the node when its upgraded/rebooted ok. Its very particular to our node image though. I intended to write something like the brain eventually but haven't gotten around to it. We're just doing that part manually for now. So, yeah. I really like the idea of separating the components out like you describe. |
I didn't determine the interface on how to announce the reboot to the brain. The easiest is to annotate a node if you already know where the kubeconfig is. I didn't like writing to etcd for such small thing, so I wanted an alternative to write status cm/CR/node annotation. I realised I could RPC to the brain, but then it becomes flooded regularly on large scale. I thought exposing a metric in the daemonset pod and scrape it with prom client but this is not easily done with a service on a ds. So the best alternative I found is the brain to know the ds , fetch all its pods, and query those one by one. Which lead me to 'why don't we ask the user to provide the kubeconfig', it's so easier.... (or guess it from an init) 🙂 |
For some edge cases, we might even think that the node reboot procedure should be have a flag to allow the brain to NOT drain the node for emergencies. |
Current problem with the brain in this form: We now need the kubelet to be always available to be able to schedule pods into the node for a reboot. Kured can't be used for emergency reboots anymore. |
Broadly I like this. I don't like to increase the number of moving parts, but some features, e.g. prioritising nodes or rebooting N in parallel, are too hard to implement with all nodes autonomous. One thing though: hasn't someone done this already? I'm sure I have discussed Kured in comparison to some other tool that had a central operator to manage reboots. |
https://github.com/jamiehannaford/coreos-reboot-operator is archived, so is the tool superseding it ... Maybe there are others? I like the idea of reuse. Any ideas? |
That does look very like the one I was thinking of, thanks. Did anyone ask in the Kubernetes Slack channels? Might get some more prior art there. |
Awesome, thanks @bboreham |
This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days). |
We have no news from the SIG node front. I didn't ask for more details there. |
In this proposal the brain would launch jobs to reboot nodes. Along the lines of #416 , if there is a significant redesign, to me (as a security conscious cluster operator) it makes sense to separate the rebooting logic (brain/detector) from the reboot executor (which actually does the reboot) as much as possible. The logic entails code (which has the risk of bugs, supply chain attacks, and generally overall larger attack surface) but should ideally require zero privileges to run. The reboot executor on the other hand, entails elevated privileges, but should actually be as brainless as possible, asking only "should I reboot? yes/no". Ideally it would have enough privilege to do one and only one thing, gracefully reboot the node with systemd, but that is challenging to enforce in k8s. One idea for this would be to have a simple drop-in cron job on the nodes which just queries a kured component (brain/detector) (possibly via curling a port) to see if a node should be rebooted. I realize some people may like a all-in-one helm deployment, but from a cluster admin's perspective it is very appealing to have a very simple cron job to reboot a node if/when kured says it should, because then all elevated privileges would be contained in a very simple bash one-liner that is fully under my control, and kured could be (possibly aside from reboot detection) unprivileged. |
I don't agree with the verbiage above. Practically it makes sense to separate what runs to detect , what triggers a reboot , and what is effectively rebooting. Your "executor" should not ask whether a reboot is required. This is because this can mostly be done without privileges, or with a different set of privileges than the actual rebooting . What you are proposing with your cron job is not to far away of what I am proposing, with the disadvantage that you have to go through all nodes to adapt your cron job. You're right, we are here paying the price of having no configuration management that's active outside and inside kubernetes in an efficient way, that the whole community has agreed upon ;) As a reminder I am proposing the following:
This would give far more flexibility to kured. That would still work for your case where you want to run everything with no privileges from k8s perspective: The maintenance would simply drop a script to the host by mounting a certain folder, and that script (chmod-ed 555?) can then be executed by your cron job. |
This would make the "next gen" basically a node maintenance tool, hence the conversation with the kubernetes node SIG. It wouldn't be too far away from the tools above or from https://github.com/rancher/system-upgrade-controller |
This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days). |
This PR implements the possibility to schedule a reboot (one time, for one single node) from the config yaml.
Potentially also relevant: kubernetes/enhancements#4212 |
Hello everyone.
I guess this was supposed to be a blog post, but I will make it an issue instead. We can discuss it better here.
Intro
Our lovely kured works well for many. It's simple, and I would like to keep it that way.
Yet, we recently had more and more special cases, which requires building a bit more complexity around our code.
Slowly, we'll increase our complexity, and reduce our readability.
Next to this, we have seen PRs and issues in the past mentioning shortcomings of our current architecture, for flexibility 1 2 or security reasons 3.
I would like us to think about what "another kured" (for more complex cases) could be.
Here are my thoughts...
Separate kured daemonset in components
Instead of having a single daemonset doing everything, I would separate in multiple components:
The Brain
The brain would be based on the current code from
rebootAsRequired
.Instead of locking on the kured daemonset, it would lock on a separate object, like a config map.
The rest would be like the current model: checking for any reboot blockers, draining the node (which requires patch node permissions), and triggering a reboot of one to multiple nodes (depending on the "Nodes Maintenance Window"). Because the reboot wouldn't be local anymore, the brain would create a job to trigger a reboot scheduled on the necessary node, job which can be described in the "Reboot Procedure".
Configuration of the behaviour of the brain would be done by editing custom resources and/or config maps.
The "Node reboot detector"
This daemonset will detect if the node needs rebooting, and simultaneously report it back to the "brain" and optionally expose that metric to prometheus. The latter is similar to our
maintainRebootRequiredMetric
. We can have a consistent period check between the brain and the detector by making sure both of them readperiod
from the same config map/CR.Alternatively to "report it back to the brain", the brain could
fetch
the node information exposed with therebootRequiredMetric
, or thereboot detector
couldPATCH
a node with the relevant information (maybe by relying on Node Admission controller by mounting kubelet's credentials). TBD.The daemonset is likely to be a single container, which will check the presence of a file at an expected location
/kured/reboot-required
. This will only require read access to that volume. No need forprivileged
orhostPIDs
.For the basic ubuntu workloads, this is the same as having our daemonset, with a volume mapping
/var/run
to/kured/
in read-only mode.For custom workloads, one user could define it's own daemonset containers, additional to our "standard one". The standard container would still check the presence of the file `/kured/reboot-required. However, in the custom case, that file could be mounted from a volume that's reachable from another container in the daemonset. That other container would implement its the own logic to detect if a reboot is required.
What do we gain?
Flexibility
The node detector structure allows the users to implement their own detection systems without any code change from kured.
The brain architecture allows the users to implement their own reboot procedures without any code change from kured, and would allow more flexible maintenance windows.
Simplicity
The detector is now simpler, as it is just watching over a single file, on a read-only file system, with no extra privileges.
The brain simplifies some code, as the coordination features like throttling will be easier to implement.
However, we will have a more complex architecture with an operator.
Security
We can isolate components better, so we can tighten the access controls.
The
detector
daemonset would only need to access to:brain
config map, in read only, to know the check periodThe
brain
would need access to:Note: With the reboot procedure outside the kured code, we can think of using less privileged access: Using an API to reboot a node, using different reboot methods. This would also allow to use seccomp profiles in multiple places!
Scaling operations
By adding a NodesMaintenanceWindow CRD, a user can define multiple maintenances, including exceptional ones, by adding a new "Maintenance Window". This will be read by the
brain
when necessary. Creating a Custom resource will allow us to create new fields, like "Serialization", which would define how many nodes are to be rebooted at the same time for example.The text was updated successfully, but these errors were encountered: