-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node Maintenance Controller KEP #1080
Conversation
Welcome @yanirq! |
Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA. It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Hi @yanirq. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
## Motivation | ||
|
||
Move kubectl drain into the server to enable execution of node drain using custom resources calls to the API in order to perform the drain. | ||
This would also reduce the proliferation of drain behaviours in other projects that re-implement the drain logic to accustom their needs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that'll be better to highlight the gap of taint/unschedulabe
|
||
### Goals | ||
|
||
- A new controller that implements the drain logic that is currently [implemented](https://github.com/kubernetes/kubernetes/blob/v1.14.0/pkg/kubectl/cmd/drain/drain.go#L307) in [kubectl](https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#drain). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not handle in node-life-cycle controller?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it could be incorporated as part of the node-life-cycle controller even better.
This is the initial description following a project already started (https://github.com/kubevirt/node-maintenance-operator) but it makes sense it should be a part of an existing controller.
The doc could be amended to : Either a new controller or use the node-life-cycle controller
### Goals | ||
|
||
- A new controller that implements the drain logic that is currently [implemented](https://github.com/kubernetes/kubernetes/blob/v1.14.0/pkg/kubectl/cmd/drain/drain.go#L307) in [kubectl](https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#drain). | ||
- The controller watches and acts upon maintenance CRs to put/remove a node on/from maintenance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CRs seems a command to k8s cluster, is that only for maintenance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CR can be interpreted as a different way to trigger drain using k8s API instead of CLI, opening new options including adding on top of the CR additional info such as: status,reason, custom taints/labels, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, I mean introducing a common CR, named as Command
, for such kind of feature :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, yes. Here is an example for this common cr:
https://github.com/kubevirt/node-maintenance-operator/blob/master/deploy/crds/nodemaintenance_crd.yaml
I find the idea of Node Maintenance useful, and I'd like to suggest an optional field for the spec: timeClass/timeEstimate (name TBD). I think that, when taking a machine into maintenance it's possible to know approximately how long it's going to be in Maintenance, and it would be useful if that info could be passed to the api so other entities that care about node state (like those that replicate on local storage) should know whether to perform an expensive data resizing operation (recreating replicas on another node) or simply wait for the node to come back up. |
An initial goal similar to your request was mentioned here. This can certainly be added to design part in case needed. |
I signed it |
/ok-to-test |
|
||
## Summary | ||
|
||
Node maintenance controller that performs node drain - server side using custom resources calls to the API in order to perform the drain (cordon + evictions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason this proposes a CRD and not a core API?
|
||
## Motivation | ||
|
||
Move kubectl drain into the server to enable execution of node drain using custom resources calls to the API in order to perform the drain. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is a description of the solution, not the motivation. Maybe it can be incorporated into summary or goals?
|
||
(https://github.com/kubernetes/kubernetes/issues/25625) | ||
|
||
### Goals |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These goals mostly describe the proposed solution. I would prefer to see the goals section describe the new capabilities or problems being solved.
#### Story 1 | ||
As a third party developer/operator I want to evacuate my workloads appropriately when a node goes into maintenance mode, but I do not want to evacuate them when the node is cordoned. | ||
|
||
- Today we can not differentiate between cordon and drain using cluster information |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I myself don't understand the difference between cordon and drain. Why can we not tell the difference today? Please consider adding a description of the terms somewhere in this proposal.
#### Story 4 | ||
As an operator I need to put a number of nodes i.e. a rack of nodes into node maintenance. | ||
|
||
- As a cluster admin i would like to shutdown a rack (due to overheating problem) to do that I would need to move hosts to maintenance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should explain how the operator's experience changes with this proposal. We already have kubectl drain
and things like kubectl get nodes --selector=rack=abc
. How will this proposal improve upon that?
- As a cluster admin i would like to shutdown a rack (due to overheating problem) to do that I would need to move hosts to maintenance. | ||
|
||
#### Story 5 | ||
- As a cluster admin, i would like to perform a firmware update on the nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this different from Story 4? Are you proposing any new feature/automation around firmware/bios updates?
|
||
### Implementation Details/Notes/Constraints | ||
|
||
- Custom taints added to the node by the controller might be overun by other controllers such as the node life cycle controller for the `TaintNodesByCondition` feature or the node controller for the `TaintBasedEvictions` feature. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By "overrun" do you mean "overwritten"? If you are using a custom taint key, I would not expect other controllers to modify your taint.
|
||
## Proposal | ||
|
||
Node maintenance controller watches for new or deleted custom resources e.g. `NodeMaintenance` which indicate that a node in the cluster should either: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one key question is whether this should be implemented as a CRD or as part of the NodeSpec
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this controller be considered a "core" addon? Or an optional addon? Are there any plans to work with sig-cluster-lifecycle to have this controller deployed by default with kubeadm
?
This is something that we would like to consume within the Cluster API project, and the more integrated it is with the default deployment scenarios the easier it will be for us to adopt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposal here aims for core integration as much as possible, but i guess this could be discussed in later phases.
I would however add as many use cases and needs for this KEP to strengthen the case, either in this initial PR or in one of the following if it gets merged.
|
||
### Goals | ||
|
||
- A new controller that implements the drain logic that is currently [implemented](https://github.com/kubernetes/kubernetes/blob/v1.14.0/pkg/kubectl/cmd/drain/drain.go#L307) in [kubectl](https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#drain). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind providing a high-level description of the drain logic you reference?
## Proposal | ||
|
||
Node maintenance controller watches for new or deleted custom resources e.g. `NodeMaintenance` which indicate that a node in the cluster should either: | ||
- `NodeMaintenance` CR created: move node into maintenance, cordon the node - set it as unscheduleble and evict the pods (which can be evicted) from that node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems a bit odd to use create/delete as the "spec" for the CRD. Are there any additional options in the CRD spec? What info gets written to the CRD status?
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: yanirq The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@misterikkit thank you for the review. |
There are probably several good reasons for node maintenance intent to be a separate object from the node spec. I am not by any means an API expert. As I understand it, modifying node labels is to be pretty restricted for security reasons. Having a separate object allows simpler RBAC to control who is allowed to drain nodes, without granting extra privileges. Also, we can find precedent for adding new core types as |
/assign @derekwaynecarr |
I think we've identified a need here, but I'm not set on the implementation. What we need is a mechanism to say "I'm working on this node" and to inform other well-behaved parties "Don't mess with this node, it's not your turn" I would call this idea the 'maintenance lock'. This could be an annotation or part of the spec.
This allows a distributed model. Some controller might want to drain (Node Maintence Operator/Controller), or some controller might want to reboot the node (cluster-api), or some controller might want to upgrade the node (OpenShift's machine-config-operator), some component might want to schedule a debug pod and then cordon but not drain, the possibilities are many. This being an informative piece, it allow building integrations with monitoring and remediation systems. As far as RBAC, if you have permissions to request some "Node Maintenance Object" that directly impacts the node, you probably have admin-level permissions anyway. In any case, there's nothing preventing an operator from deploying something like "Node Maintenance Controller" that interacts with the node directly, thus achieving the same separation (eg, if controller permission to drain the node, then you have permission to update the drain object). |
@michaelgugino your comment about the "maintenance lock" got me thinking about how locks in distributed systems are just leases without liveness, and we already have a NodeLease object. I might suggest we leverage the existing NodeLease api instead of adding a lock - you "take ownership of the node maintenance lease" in order to do maintenance, and we could define various fields for "why does this lease exist". That's slightly orthogonal to this KEP, but if you could link any discussion you have about the maintenance lock here (or we can move that discussion to the other spot) that would help. |
@smarterclayton that sounds reasonable. There's not really any other fleshed-out discussion, just mentioned in a couple places. I can put together an enhancment issue to track. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Hi @yanirq, This PR has not been updated or discussed in over a year. As such, it is not mergeable in its current state, as it does not follow the standard KEP template, which has been changed significantly since this was opened. I am going to close this PR, but when you have a chance to work on this again, please open an issue, update this to follow the KEP process, and please bring this up again in a SIG Node meeting. Thanks! |
Proposal for Node Maintenance Controller (Server side drain) As discussed in sig-node meeting (2019/04/30)
and on : kubernetes/kubernetes#25625 (see Comment)