-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add KEP: Node Maintenance Lease #1411
add KEP: Node Maintenance Lease #1411
Conversation
Welcome @michaelgugino! |
Hi @michaelgugino. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Some notes from recent sig-node meeting. Sig-node members suggested that the ownership of this KEP belongs to sig-cluster-lifecycle. I think that's a good approach. Also, TODO: Add some notes about not impacting the kubelet or scheduling. The proposed lease has no direct effect on the kubelet, only serves as informational to other components that someone/something is requesting exclusive access to disrupt the kubelet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@michaelgugino here's some feedback. It's only nits.
/ok-to-test |
910461d
to
0fe63a5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more nit. Not a biggie.
@sftim Your new suggestions didn't come through here on github. |
D'oh! Thanks for explaining. |
The Lease object should be created automatically, and the ownerRef should | ||
be the corresponding node so it is removed when the node is deleted. | ||
|
||
### User Stories [optional] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
### User Stories [optional] | |
### User Stories |
was what I meant @michaelgugino
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your thoughts here @michaelgugino :)
I added some high-level questions and thoughts.
Utilize the existing `Lease` built-in API in the API group `coordination.k8s.io`. | ||
Create a new Lease object per-node with Name equal to Node name in a newly | ||
created dedicated namespace | ||
That namespace should be created automatically (similarly to "default" and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm reading the above as we're proposing the addition of a third "automatically" created namespace, which we will name "kube-maintenance-leases" (or something similar). Am I understanding correctly?
Out of curiosity, do you know if there's policy or precedent around when we add new default namespaces? I'm not sure if this concern is justified, but the thought of each new feature creating its own default namespace kind of concerns me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think creating a new default namespace is probably the cleanest implementation. I'm unsure if there is a formal namespace review process. I'm relying on the reasoning here: https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/0009-node-heartbeat.md#proposal
I'm don't really have a hard preference which namespace it lives in; putting it in the same namespace as the heartbeat would cause a name collision, and that's undesirable from a UX position (if node-names are already at max-length, then we need some kind of hashing or other mechanism which sacrifices readability).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also unsure if there's a formal namespace review process :) Hopefully some other folks will be able to weigh in with insights :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an end-to-end flow would help improve understanding.
|
||
## Motivation | ||
|
||
Currently, there is no centralized way to inform a controller or user that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity - do you know how folks are addressing this issue now? I.e. these days, if a cluster administrator needs to perform a disruptive operation against a node, what will they do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends on what components they have running. Those components might use some annotation or other mechanism to pause operations, but I don't know of any specific component doing that today.
Some non-trivial conflicting components that exist today:
https://github.com/kubevirt/node-maintenance-operator
NMO is for assisting in scheduling maintenance of bare metal hardware.
https://github.com/openshift/machine-config-operator/
MCO is for upgrading node configurations.
You don't want those two components to compete.
In addition, say you're powered off a bare metal machine that was provisioned via cluster-api. The Machine Health Checker kubernetes-sigs/cluster-api#1684 would delete that node without some kind of intervention today.
Use a designated annotation on the node object, or possibly | ||
elsewhere. Drawbacks to this include excess updates being applied to the node | ||
object directly, which might have performance implications. Having such an | ||
annotation natively seems like an anti-pattern and could easily be disrupted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm reading this statement as saying that we can't use annotations as a locking mechanism, as that is not what they were designed for - I'm I understanding correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use annotations for this. I'm not sure this type of operation fits the mold of annotations, and annotations seems to be not meant for this type of coordination; it doesn't seem like there should be a system-wide annotation, I don't think I've seen that.
In any case, if other people decide annotation is the best way/place to do this, it's perfectly fine for me, but the drawback of many things reading/writing to the node makes it less than ideal IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you used annotations for this, what would that look like, a key with a value which is the lease end time? What prevents someone from overwriting it? (Is that misbehaving?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Annotations would look pretty ugly, probably. I haven't sat down and actually drawn out how it would work, but probably it takes more than 1 annotation, possibly 3 or more. 1 for acquire time. 1 for duration/release-time, 1 for ownership. These values could be coerced into a CSV string, but that's probably even worse.
I think if anyone has a solid implementation alternative they want specified here, I can include it, but I don't feel inclined to specify in such detail something that's not really being considered.
with a specific name. Clients could use a Get-or-Create method to check the | ||
lease. This will make installing components that support this | ||
functionality more difficult as RBAC will not be deterministic at install time | ||
of those components. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another alternative to consider... (I think - I apologize if it doesn't actually meet the use case...) :)
Could we implement this entire workflow outside of core Kubernetes using CRD and custom controllers? I.e. for those who wished to use MaintenanceLease
, they could deploy the custom controller and create the CRD. I think that within the controller implementation, the proper locking behavior could be implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Certainly it's an alternative. It presents more work for everyone. If you want to use the CRD based-method, you still need to agree on a namespace (or do system-wide), RBAC, and then introduce a hard-dependency on your project (this CRD needs to be installed first).
I think utilizing the built-in coordination API over a CRD would be more optimal. We could do things like add an option to kubectl to check for a lease before drain/cordon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of an add-on, whether that's using a CRD or an annotation on Node resources. If the core of Kubernetes doesn't provide enough to let the add-on hook in the way this KEP intends, how about providing just enough extension to make the add-on possible?
kubectl could be aware of the add-on and react differently depending on whether a maintenance lease was set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm against using a CRD for this. There is already the coordination API, which seems specifically designed for this type of thing. I would be okay with an annotation on the node.
I'm unsure what you mean by 'add-on' in this context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm suggesting that this node maintenance lease not become part of the Kubernetes core, and instead it becomes a component that a cluster operator can add if they like it: an add-on.
Another KEP, #1308, covers a SIG Cluster Lifecycle mechanism for a cluster operator to configure which add-ons are added into a new cluster, and covers selection of the default configuration as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm suggesting that this node maintenance lease not become part of the Kubernetes core
Why are you suggesting this? Why would an add-on be preferential here over making it part of the core?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some good feedback questions, thanks.
For me, I think the primary discussion points are:
- Do we want to introduce a new namespace for this?
- Do we want said namespace to be a system-namespace automatically created?
- If we introduce a new system-namespace, do we want it's scope to be single-purpose, or do we need room for a kube-system-2 type namespace with a more general purpose?
- Should we just declare a kubernetes-wide annotation that can be applied to the node, and skip a bunch of hassle? EG: node-maintenance-lease.k8s.io: "{'acquired-time': xxx, 'expire-time': xxx}" or similar.
Is there any interest in providing a more generic mechanism? For example, maybe I can mark maintenance against a StatefulSet, Service, or PersistentVolume. Maintenance doesn't just happen to Nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great summary @michaelgugino !
Yeah, I think at the end of the day, my main question (which I don't know the answer to) is what is the cost of adding a new system-wide namespace? My high level fear is that, if the trend of creating new namespaces continued and if we fast forward a year, an engineer new to k8s would launch a cluster and see a ton of new namespaces they didn't understand. Whether that's an actual problem is definitely up for debate though.
One thought - is there a way we could create the system-wide namespace "on demand"? I.e. it does not exist until the first time someone tries to take a maintenance lease? With the adhoc creation, the only folks experiencing "namespace pollution" are those who have knowingly opted into this feature.
I agree with you @michaelgugino that the Lease
mechanism does feel like a mechanism for implementation, so I agree with your hesitancy to create something else from scratch.
BTW add-ons are things like https://kubernetes.io/docs/concepts/cluster-administration/addons/ |
Seems valid to call that out (it's a concern for me too). |
We could not add a new namespace and use an existing one. I'm open to suggestion here. I went with new due to the reasoning here: https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/0009-node-heartbeat.md#proposal
We could definitely use the kube-system namespace, or we could create some new more-generic namespace that has a variety of uses (which sounds like kube-system). |
@wojtek-t I disagree, there is not much overlap. That KEP is about a node that goes down, this one is about maintenance. The proposed component(s) in the shutdown-KEP could respond to the proposed lease in some way, or not. Ideally, if you have the maintenance lease, you drain the node before doing things to it, so the shutdown-KEP wouldn't really take any meaningful effect. |
It's pretty onerous to require existing open keps to use some new format. I'm fine for new keps using the new template, but we shouldn't require all of them prior to merging. |
@michaelgugino as this currently stands this KEP can't be tracked by the enhancements team as it doesn't have an associated issue in the repo. This is not a new requirement. You don't need to fill out a new template, I'm just asking you to rename your file and split out the metadata. |
The Lease object will be created automatically by the NodeLifecycleController, | ||
and the ownerRef will be the corresponding node so it is removed when the node | ||
is deleted. If the Lease object is removed before the node is deleted, the | ||
NodeLifecycleController will recreate it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the typical pattern for leases? Would it make sense for the lease to be created by the first thing trying to grab it, and then release it when they are finished by removing it? IIRC the lease libraries normally support this? Or is the advantage here that the NodeLifecycleController can make sure the owner references are set correctly?
|
||
### Use existing system namespace for lease object | ||
|
||
We could utilize an existing system namespace for the lease object. The primary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original client-go locking used an annotation on a resource (eg a ConfigMap), I think this is what Tim refers to. The implementation underneath would attempt to patch the annotation value with the information about the lease (holder, duration etc) and rely on etcd to ensure conflicts were handled appropriately (ie if you tried to acquire after someone else had updated, you get a 409)
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
/remove-lifecycle rotten |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
/remove-lifecycle rotten |
owning-sig: sig-cluster-lifecycle | ||
participating-sigs: | ||
- sig-node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iterating comments from before, IMO this should be in reverse.
KEP folder / code owner (NodeLifecycleController) -> KEP owner
other SIGs -> participants
participating-sigs: | ||
- sig-node | ||
reviewers: | ||
- "@neolit123" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can update the list with participants in this KEP review.
reviewers: | ||
- "@neolit123" | ||
approvers: | ||
- TBD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, ideally, someone who maintains NodeLifecycleController.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Add exception to pointer guidance for structs that must be omitted
No description provided.