-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create KEP for instance host labelling #997
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: misterikkit If they are not already assigned, you can assign the PR to them by writing The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
||
<!-- tocstop --> | ||
|
||
## Release Signoff Checklist |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need to keep this section in the doc.
host | ||
- Any decision-making that is based on the instance host label | ||
|
||
## Proposal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like the rest of this KEP is not written yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's better to only have context of this kep :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, per the KEP template, merging in this state means agreement on the goals and summary, with design details to follow.
I should add that the proposed approach is better than the alternative IMO. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would the "instance host" topology be optional for a cloud provider? If not, what is the expected outcome for bare metal deployment or a cloud provider that does not provider "instance host" information via their APIs?
### Goals | ||
|
||
- Introduce a well known label for instance host | ||
- Introduce a controller that keeps these labels updated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this warrant a new controller or can we add this functionality to the existing node controller that applies labels for zones and other existing well known labels?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need a separate component (admission webhook, etc) for this feature. The component will be different for various platforms and cloud providers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be part of cloud-controller-manager, as that is the official interface where we get provider-specific info, including zone and region. As far as I can tell, this does not handle the case of a VM migrating to a different failure domain, but that seems like something that should be fixed in cloud-controller-manager.
Good question. Yes, the topology label will be optional and the scheduler works as before (spreads among nodes) if the physical host label does not exist. |
I think omitting the label if the cloud provider does not support it is reasonable*. There is no plan to hard-code this label into scheduler logic. Rather, the "even pod spreading" KEP would have users put the Also, if you used hostname as physical hostname for bare metal, that would still be semantically correct. 😉
|
[zone-struct]: https://github.com/kubernetes/cloud-provider/blob/0a4f4cbb5a664deb4639d7d9bf5bbde3bb3603c1/cloud.go#L208-L211 | ||
|
||
When the `PhysicalMachineID` is supplied, cloud-controller-manager will use it | ||
as the value for the `failure-domain.kubernetes.io/physical-host` label. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why "failure-domain" when the others have moved to "topology" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be fair, we haven't moved to failure-domain
yet :P but we should use topology.kubernetes.io
assuming #839 is 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intention is to match the others. It's a moving target. (:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given what we know today this seems pretty unambiguous, and I am fine to codify this label. I deeply dislike the CloudProvider lib at this point, but that's not your fault and this doesn't make it much worse. :)
I expect one day we'll find physcial hosts that don't fit this mold, and we'll cross that bridge then.
Related discussion in the mailing list, in case other folks missed it https://groups.google.com/forum/#!topic/kubernetes-sig-cloud-provider/32N59IYXogY |
I've cleaned up the wording a bit. If there is agreement on the overall approach, then we can merge this KEP and start iterating on the design/implementation. (I'll flatten commits now) |
This KEP proposes a new well-defined label for VM instance host as a topology type. In this commit I have only written the summary, motivation, and alternatives section so that we can get agreement on the overall approach before discussing design.
/hold I think this warrants a discussion in the SIG meeting to gauge what other providers think. Can you add it to this week's agenda please? |
I might have mentioned this in the SIG Cloud Provider mailing list thread but I feel that physical host topology is not any more common than something like rack topology. Should we codify both while we're at it then? If not, can we just go with provider specific labels? Echoing Tim's sentiment around the cloud provider interface, it has organically become a mess. If we can avoid adding methods like |
|
||
### Implementation Details/Notes/Constraints [optional] | ||
|
||
What are the caveats to the implementation? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we should list the challenges of what I used to call "label drift" (see [1] for details), i.e. what happens in clusters where VM placement is dynamic.
Take for example VMware vSphere (applies to other hypervisors as well, incl. Google Cloud with Live Migration I suppose) where various events (manual, host maintenance, failure) can lead to a VM to be relocated. In case of HA (reboot), the node would possible re-lable itself after boot, but usually administrators (or cluster rebalancing) thrive to maintain the desired state (e.g. vSphere DRS affinity/anti-affinity rules), leading to live migration (vMotion) of a VM back to a compliant host. The behavior when/if this happens is non-deterministic which complicates the matter.
tl;dr: advanced hypervisor features like vSphere DRS/vMotion et al, used in almost every on-prem customer environment for improved cluster utilization, fairness and availability, lead to topology.kubernetes.io/physical-host
label drift. According to a discussion on SIG Scheduling, the scheduler code might work fine here (cache invalidation/update with new labels <- kind of a relabeler needed to change this node label), but that could invalidate former scheduling decisions (e.g. affinity/anti-affinity).
[1]: vSphere Cloud Provider should support implement Zones() interface #64021
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed - this area needs careful consideration. Do you think that should be mentioned in the summary/motivation section, or in the follow up PR where we start fleshing out the design?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this warrants discussion now because it may determine the validity of the topology label in the first place
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on what @andrewsykim said
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally speaking, Kubernetes scheduling decisions are valid at the moment they are made. There is no guarantee that they remain valid in the future. For example, Pod B is scheduled on the the same node that runs Pod A when Pod B has inter-pod affinity to Pod A, but a second later Pod A may terminate. Similarly, node labels may be updated/deleted. These changes may invalidate NodeAffinity and NodeSelector constraints of pods.
We have accepted the fact that clusters are dynamic in nature and scheduling decisions may be invalidated in the future. This is an assumption that other new features should make too.
We may one day add "Descheduler" as a standard component to find such invalidations and deschedule pods. Whether we add a descheduler or not, IMO the fact that dynamic cluster changes may invalidate a scheduling decision should not be a blocker for new features.
/close Closing in favor of #1127 |
@andrewsykim: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This KEP proposes a new well-defined label for VM instance host as a
topology type. In this commit I have only written the summary,
motivation, and alternatives section so that we can get agreement on the
overall approach before discussing design.
/sig cloud-provider
/sig scheduling
/ref kubernetes/kubernetes#75274