Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP: EndpointSlice API #1086

Merged
merged 1 commit into from
Jul 26, 2019
Merged

KEP: EndpointSlice API #1086

merged 1 commit into from
Jul 26, 2019

Conversation

freehan
Copy link
Contributor

@freehan freehan commented Jun 4, 2019

The proposal doc has been circulating for quite a while. https://docs.google.com/document/d/1sLJfolOeEVzK5oOviRmtHOHmke8qtteljQPaDUEukxY/edit

Formally converting the proposal into a KEP

@k8s-ci-robot
Copy link
Contributor

Welcome @freehan!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/network Categorizes an issue or PR as relevant to SIG Network. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 4, 2019
@k8s-ci-robot k8s-ci-robot requested review from dcbw and thockin June 4, 2019 17:02
@freehan
Copy link
Contributor Author

freehan commented Jun 4, 2019

/assign thockin

@freehan
Copy link
Contributor Author

freehan commented Jun 5, 2019

cc: @wojtek-t @bowei @johnbelamaric

type EndpointSlice struct {
metav1.TypeMeta
metav1.ObjectMeta
Spec EndpointSliceSpec
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no status?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No status for now. There is no consumer/provider for EndpointSlice status. Unless we want to allow all kube-proxy to feedback into it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like it's more "all status" than it is "all spec"...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After second look, it actually feel like "all status". But I guess it depends on the perspective:

  1. from service point of view, endpoints is like its status
  2. from consumer (kube-proxy) point of view, endpoints is like spec they need to implement. EndpointSlice object status is more place holder for consumer feedback. (similar to kubelet implements podSpec and feedback podStatus)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @apelisse
We discuss this in sig-network and decided to keep it dumb by making it an object without spec and status, only focusing on carrying data.

}

type EndpointPort struct {
// Optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be:

optional if there is only one Port.

Name string
// Required
Protocol Protocol
// Optional: If unspecified, port remapping is not implemented
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this mean? I see that this is a required field in the current EndpointPort

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will add a section describing this.

Ports []EndpointPort
}

type EndpointPort struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't EndpointPort name already taken?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, should we make sure the names are less clashing with v1.Endpoint structs...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we are moving to networking api group. Does it matter?

}

type Endpoint struct {
// Required: must contain at least one IP.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably want to copy over the comment from the dual stack PR about the semantics of multiple IPs?

// Required: must contain at least one IP.
IPs []net.IP
// Optional
Hostname string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Hostname is optional, why is it not a pointer field?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This derives from the hostname field in pod spec. That field is optional but not pointer as well. Not sure the reason at the time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be a pointer. Pod should probably be changed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. change it to pointer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is included in backends, why would it appear separately here?

TargetRef *ObjectReference
}

type EndpointCondition struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EndpointConditions?

}

```

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worthwhile to give an example section with the YAML output etc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack will add

Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's talk about topology and Services.

When we talked about this before, we sort of ruled out having topology information in Endpoints because of bloat. If we are really trying to make EndpointSlice more general-purpose, I think we need to reconsider this. I'd love for systems like Istio to be able to glean topology information from kubernetes and "go watch all nodes" seems unfortunate. For other arbitrary consumers of endpoints it seems even worse.

So what would change if we wanted to include topology information (basically the labels of the node a pod is running on) inside EndpointSlice somewhere?

Would we encode it as a topology per slice? That stinks because of hostname. Maybe we special-case hostname? Any other unique-per-node label would kill us, too.

Would we encode it per Endpoint[] element? It would be bloaty and somewhat redundant, but correct.

Would we pre-cook node labels into a new resource (in the same apigroup as EndpointSlice) and simply reference that from each Endpoint[] ?

@wojtek-t @johnbelamaric @m1093782566

- Leave room for foreseeable extension:
- Support multiple IPs per pod
- More endpoint states than Ready/NotReady
- Dynamic endpoint subsetting
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a secondary goal of moving the API towards being more general-purpose and less locked into Kubernetes patterns?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack


## Proposal

### EndpointSlice API
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's argue about names. Slice? Set? Something else more descriptive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using Slice is mostly to avoid confusion with Subset.
Slice is sort of neutral to subsetting or overlapping.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slice implies a piece of a whole, rather than a whole (as in Set). It's not horrible, but I wanted an anchor point for any debate :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious how this specific set is selected? I'm not sure I understand what this means or how this works.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just a chunk. A semi-random subset of a whole.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mentioned a crossed-correlation between EndPoints and Ports. How are these separated in slices? I think the API is sort of broken if objects can't live independently. If we do accept that for trade-offs reasons, then it needs to be very explicit. I like chunk more than slice because it conveys the fact that a chunk doesn't make sense on its own (all chunks are needed in order to use it).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a given slice the whole Ports[] block applies to all Endpoints[]. So pathologically, if every endpoint has a different port number, they would all get slices of length 1.

keps/sig-network/20190603-EndpointSlice-API.md Outdated Show resolved Hide resolved
Port *int32
}

type Endpoint struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments. Something like:

Endpoint describes a single backend instance (e.g. Pod) of an abstract front-end (e.g. Service).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack


type Endpoint struct {
// Required: must contain at least one IP.
IPs []net.IP
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to consider non-IP endpoints, if we want this API to be more general-purpose (and I think we do). Specifically, allowing hostnames here and maybe even URLs.

Granted, kube-proxy could not easily set up iptables or ipvs for k8s services that had hostnames as backends, but a) Services should have Endpoints generated by controllers and b) other consumers of endpoints could be smarter.

What if we named this "address" or "backend" or similar?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we don't generally use net.IP in API structs. string

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we need to spec that all addresses here are assumed to be fungible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. Changed it to Backends []string

For self managed EndpointSlice objects, this label is not required.

## Estimation
This section provides comparisons between Endpoints API and EndpointSlice API under 3 scenarios:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think we should have 2 tables - one for 20K endpoints (edge case) and one for 20 endpoints (average case).

We know that MOST services have less than 100 endpoints, so while 20K dominates the scalability considerations, <100 dominates the practicality considerations.

All this leads to answering "why not 1 per slice?" issues.

The point being to make it very clear

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

On Service Create/Update/Delete:
- `syncService(svc)`

On Pod Create/Update/Delete:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also should describe the reconciliation (because edge cases) - e.g. how do we detect / handle finding EndpointSlices that have had their Service removed (where we might not have gotten notified, e.g. we were down) ? I suspect you'll need a label that just says "is managed by service-endpoint controller" and then compare against known services and find leftovers. But be careful not to get this wrong and nuke real services!

Also how we handle labels changing

etc, how do we make sure it all stays in sync

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we use OwnerReferences for this purpose?

  1. If we create EndpointSlice based on the service, we add owner-ref and when service will be deleted, gc will delete endpointslice objects for us.
  2. If user creates EndpointSlice, they don't set ownerref, and then they fully own its lifecycle.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds feasible

- If Pod needs to be removed:
- Remove the endpoint from EndpointSlice.
- If the # of endpoints in EndpointSlice is less than ¼ of threshold.
- Try to pack the endpoints into smaller number of EndpointSlice objects.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically, we should be really clear that we will not simply rewrite ALL of the EndpointSlices on a removal. The goal is stability, so maybe spelling out a heuristic, eg.

if num-endpoints in slice S1 is < min-endpoints (1/4 by flag) and num-slices > 1 {
try to disperse endpoints from S1 in other slices
if total number of writes needed > min(num-slices/4, 4) don't bother // all values pulled from thin air)
}

You get the point - the idea is to balance compaction with not rewriting too many slices. I'd actually argue the values should be very lax to start, e.g. allow as few as 5% or 10% of a slice before forcing compaction, which obviously affects the other variables.

Also be careful not to let users get into unsolvable situations :) Not all of these variables are independent, I think.

Might be worth writing a simulation to see if there's a natural pattern - I feel like this is, but I can't quite place my finger on it right now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will write down the detail algorithm in this section. Added a requirement section above under implemetation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider also attempting to pack the noisiest pods together, so their updates can be combined.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider also attempting to pack the noisiest pods together, so their updates can be combined.

That is an implementation detail and we can also change the packing policy later.

In order to ensure backward compatibility for external consumer of the core/v1 Endpoints API, the existing K8s endpoint controller will keep running until the API is EOL. The following limitations will apply:

- Starting from EndpointSlice beta: If # of endpoints in one Endpoints object exceed 100, generate a warning event to the object.
- Starting from EndpointSlice GA: Only include up to 500 endpoints in one Endpoints Object.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

500 seems low, I'd argue for 1000 unless the math is bad?


Based on the data collected from user clusters, vast majority (> 99%) of the k8s services have less than 100 endpoints. For small services, EndpointSlice API will make no difference. If the MaxEndpointThreshold is too small (e.g. 1 endpoint per EndpointSlice), controller loses capability to batch updates, hence causing worse write amplification on service creation/deletion and scale up/down. Etcd write RPS is significant limiting factor.

- Why do we have a status struct for each endpoint? Why not boolean state for readiness?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't any more?

type EndpointPort struct {
// The name of this port (corresponds to ServicePort.Name).
// Must be a DNS_LABEL.
// Optional only if one port is defined.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conditional "requiredness" is a little confusing. Can we group the fields in a different way?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly, this is how the legacy API is written. But we should do better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree it is confusing. But it is currently everywhere, from pod spec, service spec all the way to endpoints.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a new resource we can change the rules. We should think of this from the POV of the api machinery and the use-cases. Do we want to allow people to patch these lists or do we see them as atomic? I think addresses certainly wants to be patched, but ports I am less 100% sure.

To be patchable we need a good merge key, so optional values are not great for that.

We could argue that the merge key is "port number + protocol", and name is just a decoration. Even then, sometimes optional is blech. Also, protocol is optional-with-default, which makes the machinery ugly.

We could argue that the merge key is name, and that "" is a valid name. We'd have to look at places that do named-port lookups against endpoints and make sure that "" is allowed.

That still leaves a bit of ugly in that we really should be able to use the same name across protocols (e.g. DNS has them as different fields in SRV) but that, again, is made ugly by protocol-defaulting.

So I think my position is that name is the (required) merge key and "" is a valid name (not omitempty). Looking at @apelisse for more formal guidance. It seems to me that sometimes-optional should be described somewhere as an anti-pattern.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be patchable we need a good merge key, so optional values are not great for that.

Yes, merge key(s) should not be optional.


type EndpointSliceSpec struct {
Endpoints []Endpoint
Ports []EndpointPort
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these two list independent from each other, or is there some hidden implicit connections between their items?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hidden implicit is that for each endpoint in []Endpoint, it exposes ports in []EndpointPort.

This is mostly to avoid containing duplicate information per Endpoint.

The new EndpointSlice API aims to address existing problems as well as leaving room for future extension.


## Motivation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain the use case for having multiple megabytes of endpoints for a single service in the first place?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how people are using k8s still and I've heard different usecases.
The last one I've heard is that they have single-threaded app and they need thousands of them to serve the traffic.

- increase the etcd size limits
- endpoints controller batches / rate limits changes
- apiserver batches / rate-limits watch notifications
- apimachinery to support object level pagination
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think more discussion of the alternatives, and why you think EndpointSlice is the best one, would be useful. Are any other parts of kubernetes running into scaling issues that might be fixable with one of these more-generic fixes?

Another alternative would be to have kube-proxy just ignore Endpoints, and instead watch Pods and compute the endpoints itself. (This has similar properties to "1 Endpoint per EndpointSlice" in some ways but not in others.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ Agreed, this is a brand new pattern, and would set a substantial precedent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any of the generic solutions would really solve the problem - they just help to some extent, I've heard users asking for.

Watching all pods by all nodes would kill us :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Watching all pods by all nodes would kill us :)

well, note that some (many?) network plugins have to do that to implement NetworkPolicy currently, so a more generic fix for the Endpoints problem would help with NetworkPolicy scaling too.

Copy link
Contributor Author

@freehan freehan Jun 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have briefly explored the alternatives. The batch/rate limiting mechanisms essentially trade supporting large # of endpoints with something like endpoint update latency (we have heard complains with some specific cases already). In order to achieve optimal result (low latency + large scale), it can get quite complex when more dimensions are considered, like # of nodes etc. Hence, it is equal or more work for apimachinery to implement these features, with less performance gain.

In addition, Slicing up endpoints aligns better with the established pattern for load balancer scalability: endpoint subsetting. Thus, opening the door for future extension in this direction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like scalability for network policy is very different from endpoints scalability. It heavily depends on the underlying implementation.

The worst case is "Watching all pods by all nodes".
The ideal case is like "Only watch pods on its own node".

So I do not think this particular proposal would help in that sense.

However, if one day we decided to go with service policy, then this proposal will automatically benefit the implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we almost discussed this in the SIG meeting last night but were then running out of time. Anyway, I found EndpointSlice to be a fairly surprising proposal, and not how I would have guessed that people would have wanted to address this issue, so I wanted to see more discussion (in the KEP, documented for posterity) about why you ended up going with this approach rather than the alternatives.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added why not pursue the alternatives? in the FAQ section

type EndpointSlice struct {
metav1.TypeMeta
metav1.ObjectMeta
Spec EndpointSliceSpec
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like it's more "all status" than it is "all spec"...


### Kube-Proxy

Kube-proxy will be modified to consume EndpointSlice instances besides Endpoints resource. A flag will be added to kube-proxy to toggle the mode.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to autodetect this? eg start with Endpoints but the moment we see an EndpointSlice switch to watching those? Seems like a shame to add another flag.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this?
The controller creates a config map with specific name.
kube-proxy tries to get the config map on start up. If the config map exists, then use EndpointSlice, if not fall back to Endpoints.

Copy link
Contributor

@imroc imroc Jul 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If all goes well, I think EndpointSlice and TopologyAwareServiceRouting goes Alpha in 1.16 together, which EndpointSlice implement controller only, and TopologyAwareServiceRouting watch node for topology infomation. EndpointSlice and TopologyAwareServiceRouting goes Beta in 1.17 together, which kube-proxy will consume EndpointSlice, and TopologyAwareServiceRouting also relies on this, so what about consuming EndpointSlice when TopologyAwareServiceRouting is enabled, and I can help to implement this.

// Required: must contain at least one backend.
// This can be an IP, hostname or URL.
// The consumer (e.g. kube-proxy) decides how to handle the backends.
Backends []string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs some explanation around how eg kube-proxy would handle Backends of type URL/hostname. eg if a given Service Proxy can't handle one of the types what is the expected behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently stated that the consumer determines how it would handle different type of backend.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the use case for URL or hostname?

What is the use case for more than one entry?

If more than one, how are consumers supposed to choose which to use?

What if some of the backends are ready but not all?

Copy link
Contributor Author

@freehan freehan Jul 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hostname already exists in the v1 Endpoints API. It is used for KubeDns resolve pods by hostname.

Adding more than one entry is originally aimed to support dual stack where each pod has 1 v4 and 1 v6 IP.

I think in the Dual Stack KEP, it states that this effectively kicks it back to the consumer. For instance, kube-proxy can pick the 1st v4 IP and the 1st v6 IP to program.

As one pod might have multiple IPs, we only consider pod readiness. So if one pod is ready, then all backends are ready.

For URL, there is no consumer yet. But the goal is evolve K8s Endpoints API towards a generic service discovery API so that it can be consumed by 3rd party solutions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd strongly consider splitting these into a more helpful system. E.g.:

Hostnames []string
IPAddresses []string
URLs []string
FQDNS []string
...

Alternatively, you could do a list with type and value fields. The way it is now, you can't really validate an IP address (maybe it's a funny looking DNS name with an in-house TLD).

As I understand it, these are different ways of reaching the exact same backend. I think "Aliases" or "Names" would be much less confusing names than "Backends".

Copy link
Contributor Author

@freehan freehan Jul 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I am considering adding a field to describe backend type. So that all Endpoint in the EndpointSlice will have the same type of backends, plus API server validation can enforce that.

Consumers will also be easier to determine if it can handle the backend type.

@freehan freehan force-pushed the master branch 3 times, most recently from d937a95 to ee7a871 Compare June 28, 2019 20:26
// Required: must contain at least one IP.
IPs []net.IP
// Optional
Hostname string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be a pointer. Pod should probably be changed.

// Required: must contain at least one backend.
// This can be an IP, hostname or URL.
// The consumer (e.g. kube-proxy) decides how to handle the backends.
Backends []string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the use case for URL or hostname?

What is the use case for more than one entry?

If more than one, how are consumers supposed to choose which to use?

What if some of the backends are ready but not all?


```

### Mapping
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are endpoints mapped to slices?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on the algorithm in controller below.

| # of total watch event | O(N) | O(N) | O(N) |
| | 5000 | 5000 | 5000 |
| Total Bytes Transmitted | O(PN) | O(BN) | O(N) |
| | ~2.0MB * 5000 = 10GB | ~10k * 5000 = 50MB | ~1KB * 5000 = ~5MB |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure a single endpoint update is a good thing to think about, since in practice we should expect multiple endpoints to change within the time period in a huge service.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't agree with that. We want to keep the "batching period" fairly small (less than 1s). Given that, (obviously assuming there isn't a rollout or sth happening now), it's very rare to have changes of more than 1 pod within that period. So I actually think that it is a metric that we should be looking at.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, then it's not comparing apples to apples-- the current mechanism is like 10s worth of batching IIRC. So when the rate of pod changes is above .1 per second, the rate of updates is a constant .1 / second, and never goes higher.

There's some implicit assumption here about how frequently state-changing things happen to pods. I am not sure what a good assumption is. If we assume the "mean time to state change" is like 1 day (~86k seconds), then we'd expect a service with 20k pods to have about 1 change every 5 seconds, which implies two (small) updates in the new system vs one (big) update in the old system.

But the calculation is pretty sensitive to that time-- if my nodes are worse or my pods are buggier, and the MTtST is only 1 hour, then it's ~5.5 changes per second, and the old mechanism gives 1 big update every 10s (10GB total) and the new system gives ~55 (small) updates (50MB*55=2.7GB) in the same amount of time.

So, the new method is still much better on bandwidth, but it is by a factor of 4, not a factor of 200.

With watch events in that latter scenario, the new method causes 5000*55 = ~250k events, compared to 5k with the old method. So it is clearly much worse by this metric. I think at these numbers bandwidth is more important than events, but I'm not 100% sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand what @lavalamp is saying.

Here is a worst case scenario: 10k endpoints, 1 big Endpoints object vs. 100 EndpointSlice Objects.
Within "batching period", let us say 100 endpoints got removed and they happen to distribute across all 100 EndpointSlice objects. The old API and the new API will perform the same in this case as all endpoints have to be redistributed.

For removal, we can not do much as the endpoints may be scattered.
For addition, the controller can try to pack the new endpoints together and add to one slice.
For noisy pods, the controller can try to concentrate them into a few EndpointSlice. (This seems complex)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lavalamp - I think there is one false assumption in your thing which is: "the current mechanism is like 10s worth of batching".
This is what isn't true today. In EndpointsController (which we're talking about here) we batch "everything that happened until now" - so if pods are changing very fast we will batch some stuff, but it's more like 0.1s rather than 10s.

To clarify a bit what is happening in EndpointsController is we have a queue from which workers are taking services to process, and when processing a given service we list pods from cache and compute the Endpoints object. There is no explicit batching - batching is only a result of lag in controller.
Then 10s batching I guess you might have been talking about batching on kubeproxy side to update iptables.

And I think that explains our disagreement about this point - because we have different assumptions.
Because what I'm saying is that it rare (if we exclude creation or deletion the whole service) to have things batched by endpoints-controller.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must be misremembering something--I thought endpoints controller deliberately batched. IMO it should do that, it would be a massive bandwidth savings for high-churn services.

If it doesn't deliberately batch, it'll still be limited by how long it takes to write the endpoints object, but I agree that should be < 1 second even in pretty bad conditions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it should do that, it would be a massive bandwidth savings for high-churn services.

There is a trade-off with how fast changes are reflected in the iptables (or whatever else is used) [e.g. it is on the path for autoscaling where speed is something that customers are complaining]

So in general, I agree we can be much smarter, but it's also not that obvious.

Conditions EndpointConditions
// Optional: Reference to object providing the endpoint.
TargetRef *v1.ObjectReference
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think an endpoint needs a counter or some other mechanism to enable moving it between slices (e.g., if it appears in two slices, the entry with the higher counter is more correct--the other one hasn't propagated yet).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not quite understand. If there are duplicates, the consumer will merge them, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are duplicates which disagree on some point, how does the consumer know which is correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. So it would need to add generation to each endpoint. Correct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's the first thing that comes to mind, but there might be other ways of solving the problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess there are other ways, but generation is frequent enough concept in k8s so that I would suggest using it rather than coming up something new.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have resolution to this point? At the very least, we need to document what happens if 2 slices have the same endpoint. What should consumers do? De-dup? Consider it as "weight"? If those endpoints are in conflict (e.g. ready = true and false), what should they assume?

- Skip
- If Pod needs to be added:
- If all EndpointSlice objects has reached the MaxEndpointThreshold. Create a new EndpointSlice object.
- Find the EndpointSlice with the lowest number of endpoint, and add it into it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"lowest number of endpoints" gives you a redistribution problem below. You want to completely fill up some and let others empty out as much as possible, which will make them cheaper to redistribute. So, I think you want to add to the one with the highest number of endpoints less than the maximum.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Okay. I will adjust the algorithm here.

- If Pod needs to be removed:
- Remove the endpoint from EndpointSlice.
- If the # of endpoints in EndpointSlice is less than ¼ of threshold.
- Try to pack the endpoints into smaller number of EndpointSlice objects.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider also attempting to pack the noisiest pods together, so their updates can be combined.

| Total Bytes Transmitted | O(PN) | O(PN) | O(PN) |
| | 2.0MB * 5000 = 10GB | 10KB * 5000 * 200 = 10GB | ~10GB |

### Single Endpoint Update
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my last comment-- I think it would be better to compare not a single update, but rather a rate of updates, since that is what will happen in practice. In practice the rate will likely depend on the quality of the pods, multiplied by the size of the service--as services get bigger, holding the quality of the pods constant, the rate will go up.

Considering this, I think some entries that are currently labeled O(N) actually in practice are going to represent O(N * rate) = O(N^2) total load on the system; granted, the constant factor will be favorable.

Copy link
Contributor Author

@freehan freehan Jul 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under the same scenario (e.g. size of the service), can we assume the rate of change for both APIs is the same? Since we could not determine the rate of changes on pods, both API will respond to them. The main difference is that the cost is smaller for each change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just read you response down below. I think I understand your point now. I will respond.

@vllry
Copy link
Contributor

vllry commented Jul 12, 2019

/cc

@k8s-ci-robot k8s-ci-robot requested a review from vllry July 12, 2019 04:56
@m1093782566
Copy link

After a quick look at, I think the benefit of this proposal to topology-aware service routing is that kube-proxy may avoid watching nodes and can watch endpointslice instead for getting topology/labels information. cc @imroc for deeper look.

@freehan
Copy link
Contributor Author

freehan commented Jul 17, 2019

After a quick look at, I think the benefit of this proposal to topology-aware service routing is that kube-proxy may avoid watching nodes and can watch endpointslice instead for getting topology/labels information. cc @imroc for deeper look.

Yes. That is the goal. It would be great if you can check in the API change on Service first. Then EndpointSlice controller implementation can consume that once it is checked in.

@imroc
Copy link
Contributor

imroc commented Jul 22, 2019

Would we pre-cook node labels into a new resource (in the same apigroup as EndpointSlice) and simply reference that from each Endpoint[] ?

Agree with this, I think this is a good idea: EndpointSlice holds an ObjectRef for each endpoint, pointing to a duck-typed "decorator" which is non-namespaced and auto-derived from Node.metadata (and updated when Nodes change)

cc @thockin @freehan

@freehan
Copy link
Contributor Author

freehan commented Jul 23, 2019

Agree with this, I think this is a good idea: EndpointSlice holds an ObjectRef for each endpoint, pointing to a duck-typed "decorator" which is non-namespaced and auto-derived from Node.metadata (and updated when Nodes change)

The ObjectRef on each endpoint is currently pointing to Pods if the endpointSlice is derived from service. With this approach, kube-proxy still need to watch extra objects.

@imroc
Copy link
Contributor

imroc commented Jul 24, 2019

The ObjectRef on each endpoint is currently pointing to Pods if the endpointSlice is derived from service. With this approach, kube-proxy still need to watch extra objects.

yeah, it seems we should go back to PodLocator API, or implement the metadata-only watch, then kube-proxy would just watch Node's metadata for topology information

@wojtek-t
Copy link
Member

yeah, it seems we should go back to PodLocator API, or implement the metadata-only watch, then kube-proxy would just watch Node's metadata for topology information

meta-only watch has been implemented in 1.15: kubernetes/kubernetes#71548
Basically what you need to do is watch nodes as PartialObjectMetadata

That said, I would also like to see an analysis of performance impact of this - as nodes (even with node lease feature) are still being updated every 1m, so every node being watched by every node is a lot of data.

@m1093782566
Copy link

m1093782566 commented Jul 24, 2019

Cool! Glad to know that!

@imroc
Copy link
Contributor

imroc commented Jul 24, 2019

meta-only watch has been implemented in 1.15: kubernetes/kubernetes#71548
Basically what you need to do is watch nodes as PartialObjectMetadata

That said, I would also like to see an analysis of performance impact of this - as nodes (even with node lease feature) are still being updated every 1m, so every node being watched by every node is a lot of data.

Cool! I'll take a deeper look at this feature

@freehan freehan force-pushed the master branch 4 times, most recently from e29ffc5 to 5c8612b Compare July 26, 2019 00:36
@k8s-ci-robot k8s-ci-robot added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Jul 26, 2019
Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bunch of API nits. This is not a full API review, and we can argue about names when that PR comes thru.

Merge and iterate

Thanks!

/lgtm
/approve

// +optional
metav1.ObjectMeta `json:"metadata,omitempty" protobuf:"bytes,1,opt,name=metadata"`
// Metadata of the Endpoints in the EndpointSlice.
EndpointMeta `json:",inline"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this an embedded sub-struct rather than inlined? AFAICT it is not reused anywhere?

The following new EndpointSlice API will be added to the `Discovery` API group.

```
type EndpointSlice struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still wonder about names - is there a reason EndpointSet is less good?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EndpointShard?


type Endpoint struct {
// Required: Targets of the endpoint. Must contain at least one target.
// Different consumers (e.g. kube-proxy) handle different types of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yesterday (IRL) we nixed the "type" field. Given that this is alpha, that's probably fine, but I was thinking about this more. If we really want to be future-compatible and we really think there's utility for other data types here, maybe we should bring back the type field as optional with a single value "IP" and a default value of "IP" when not specified.

The we can say something like:

Consumers should always check the type field before making assumptions about the contents of this field.

// targets in the context of its own capabilities.
// +listType=set
Targets []string `json:"targets,omitempty" protobuf:"bytes,1,opt,name=targets"`
// The Hostname of this endpoint.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe more detail, e.g. "The stated hostname of this endpoint. This field may be used by consumers of endpoint to distinguish endpoints from each other (e.g. in DNS names). Multiple endpoints which use the same hostname should be considered fungible (e.g. multiple A values in DNS)."

@johnbelamaric I don't know if CoreDNS gets this aspect rigth (we probably did not spec it well). I know KubeDNS gets it wrong.

// endpoint.
// Key/value pairs contained in topology must conform with the label format.
// https://kubernetes.io/docs/concepts/overview/working-with-objects/labels
// Topology may contain the following well known keys:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"may include, but is not limited to"

Conditions EndpointConditions
// Optional: Reference to object providing the endpoint.
TargetRef *v1.ObjectReference
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have resolution to this point? At the very least, we need to document what happens if 2 slices have the same endpoint. What should consumers do? De-dup? Consider it as "weight"? If those endpoints are in conflict (e.g. ready = true and false), what should they assume?

// +optional
Hostname *string `json:"hostname,omitempty" protobuf:"bytes,2,opt,name=hostname"`
// Required: the conditions of the endpoint.
Conditions EndpointConditions `json:"conditions,omitempty" protobuf:"bytes,3,opt,name=conditions"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want a better name, since this name has meaning. We can debate when teh API review comes.

Topology map[string]string `json:"topology,omitempty" protobuf:"bytes,5,opt,name=topology"`
}

type EndpointConditions struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should document something about the polarity or meaning of values here. E.g. are these always "good" things ? I know we hypothesized other conditions - what do they have in common?

}

type EndpointMeta struct {
// This field specifies the list of ports associated with each endpoint in the EndpointSlice
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...If this list is empty or unspecified, no ports or protocols are are defined and the endpoints are effectively unreachable.

Do we want to allow that, or demand that at least one ports block, even if {} be provided?

- "@bowei"
- "@thockin"
creation-date: 2019-06-01
last-updated: 2019-06-01
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stale

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 26, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: freehan, thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 26, 2019
@k8s-ci-robot k8s-ci-robot merged commit abdc87e into kubernetes:master Jul 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/network Categorizes an issue or PR as relevant to SIG Network. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.