Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create KEP for instance host labelling #997

Closed
wants to merge 1 commit into from
Closed

Create KEP for instance host labelling #997

wants to merge 1 commit into from

Conversation

misterikkit
Copy link

This KEP proposes a new well-defined label for VM instance host as a
topology type. In this commit I have only written the summary,
motivation, and alternatives section so that we can get agreement on the
overall approach before discussing design.

/sig cloud-provider
/sig scheduling

/ref kubernetes/kubernetes#75274

@k8s-ci-robot k8s-ci-robot added sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 24, 2019
@k8s-ci-robot k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Apr 24, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: misterikkit
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: hogepodge

If they are not already assigned, you can assign the PR to them by writing /assign @hogepodge in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment


<!-- tocstop -->

## Release Signoff Checklist
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need to keep this section in the doc.

host
- Any decision-making that is based on the instance host label

## Proposal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like the rest of this KEP is not written yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's better to only have context of this kep :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, per the KEP template, merging in this state means agreement on the goals and summary, with design details to follow.

@bsalamat
Copy link
Member

I should add that the proposed approach is better than the alternative IMO.

Copy link
Member

@andrewsykim andrewsykim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the "instance host" topology be optional for a cloud provider? If not, what is the expected outcome for bare metal deployment or a cloud provider that does not provider "instance host" information via their APIs?

### Goals

- Introduce a well known label for instance host
- Introduce a controller that keeps these labels updated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this warrant a new controller or can we add this functionality to the existing node controller that applies labels for zones and other existing well known labels?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a separate component (admission webhook, etc) for this feature. The component will be different for various platforms and cloud providers.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be part of cloud-controller-manager, as that is the official interface where we get provider-specific info, including zone and region. As far as I can tell, this does not handle the case of a VM migrating to a different failure domain, but that seems like something that should be fixed in cloud-controller-manager.

@bsalamat
Copy link
Member

Would the "instance host" topology be optional for a cloud provider? If not, what is the expected outcome for bare metal deployment or a cloud provider that does not provider "instance host" information via their APIs?

Good question. Yes, the topology label will be optional and the scheduler works as before (spreads among nodes) if the physical host label does not exist.

@misterikkit
Copy link
Author

Would the "instance host" topology be optional for a cloud provider? If not, what is the expected outcome for bare metal deployment or a cloud provider that does not provider "instance host" information via their APIs?

Good question. Yes, the topology label will be optional and the scheduler works as before (spreads among nodes) if the physical host label does not exist.

I think omitting the label if the cloud provider does not support it is reasonable*. There is no plan to hard-code this label into scheduler logic. Rather, the "even pod spreading" KEP would have users put the topologyKey into a new struct of the pod spec.

Also, if you used hostname as physical hostname for bare metal, that would still be semantically correct. 😉

  • We currently have inter-pod affinity in the scheduler, where users are warned not to use a topology key that isn't supported on the nodes they are using. I see this as an incentive for providers to label their nodes well.

[zone-struct]: https://github.com/kubernetes/cloud-provider/blob/0a4f4cbb5a664deb4639d7d9bf5bbde3bb3603c1/cloud.go#L208-L211

When the `PhysicalMachineID` is supplied, cloud-controller-manager will use it
as the value for the `failure-domain.kubernetes.io/physical-host` label.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why "failure-domain" when the others have moved to "topology" ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be fair, we haven't moved to failure-domain yet :P but we should use topology.kubernetes.io assuming #839 is 👍

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intention is to match the others. It's a moving target. (:

Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given what we know today this seems pretty unambiguous, and I am fine to codify this label. I deeply dislike the CloudProvider lib at this point, but that's not your fault and this doesn't make it much worse. :)

I expect one day we'll find physcial hosts that don't fit this mold, and we'll cross that bridge then.

@andrewsykim
Copy link
Member

Related discussion in the mailing list, in case other folks missed it https://groups.google.com/forum/#!topic/kubernetes-sig-cloud-provider/32N59IYXogY

@misterikkit
Copy link
Author

I've cleaned up the wording a bit. If there is agreement on the overall approach, then we can merge this KEP and start iterating on the design/implementation. (I'll flatten commits now)

This KEP proposes a new well-defined label for VM instance host as a
topology type. In this commit I have only written the summary,
motivation, and alternatives section so that we can get agreement on the
overall approach before discussing design.
@andrewsykim
Copy link
Member

/hold

I think this warrants a discussion in the SIG meeting to gauge what other providers think. Can you add it to this week's agenda please?

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 29, 2019
@andrewsykim
Copy link
Member

andrewsykim commented Apr 29, 2019

I might have mentioned this in the SIG Cloud Provider mailing list thread but I feel that physical host topology is not any more common than something like rack topology. Should we codify both while we're at it then? If not, can we just go with provider specific labels? Echoing Tim's sentiment around the cloud provider interface, it has organically become a mess. If we can avoid adding methods like GetPhysicalHostByProviderID I would prefer that, the alternative would be folding the existing Get*ByProviderID methods into a generic method like GetTopologyLabelsForNode that only allows a pre-defined set of well-known label keys (zone, region, physical-host, rack, etc). Maybe this is an implementation detail at this point but given the current state of the interface I would like to put more thought into how the interface will change prior to moving forward.


### Implementation Details/Notes/Constraints [optional]

What are the caveats to the implementation?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we should list the challenges of what I used to call "label drift" (see [1] for details), i.e. what happens in clusters where VM placement is dynamic.

Take for example VMware vSphere (applies to other hypervisors as well, incl. Google Cloud with Live Migration I suppose) where various events (manual, host maintenance, failure) can lead to a VM to be relocated. In case of HA (reboot), the node would possible re-lable itself after boot, but usually administrators (or cluster rebalancing) thrive to maintain the desired state (e.g. vSphere DRS affinity/anti-affinity rules), leading to live migration (vMotion) of a VM back to a compliant host. The behavior when/if this happens is non-deterministic which complicates the matter.

tl;dr: advanced hypervisor features like vSphere DRS/vMotion et al, used in almost every on-prem customer environment for improved cluster utilization, fairness and availability, lead to topology.kubernetes.io/physical-host label drift. According to a discussion on SIG Scheduling, the scheduler code might work fine here (cache invalidation/update with new labels <- kind of a relabeler needed to change this node label), but that could invalidate former scheduling decisions (e.g. affinity/anti-affinity).

[1]: vSphere Cloud Provider should support implement Zones() interface #64021

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed - this area needs careful consideration. Do you think that should be mentioned in the summary/motivation section, or in the follow up PR where we start fleshing out the design?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this warrants discussion now because it may determine the validity of the topology label in the first place

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on what @andrewsykim said

Copy link
Member

@bsalamat bsalamat May 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally speaking, Kubernetes scheduling decisions are valid at the moment they are made. There is no guarantee that they remain valid in the future. For example, Pod B is scheduled on the the same node that runs Pod A when Pod B has inter-pod affinity to Pod A, but a second later Pod A may terminate. Similarly, node labels may be updated/deleted. These changes may invalidate NodeAffinity and NodeSelector constraints of pods.
We have accepted the fact that clusters are dynamic in nature and scheduling decisions may be invalidated in the future. This is an assumption that other new features should make too.
We may one day add "Descheduler" as a standard component to find such invalidations and deschedule pods. Whether we add a descheduler or not, IMO the fact that dynamic cluster changes may invalidate a scheduling decision should not be a blocker for new features.

@andrewsykim
Copy link
Member

/close

Closing in favor of #1127

@k8s-ci-robot
Copy link
Contributor

@andrewsykim: Closed this PR.

In response to this:

/close

Closing in favor of #1127

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants