diff --git a/keps/sig-network/20190603-EndpointSlice-API.md b/keps/sig-network/20190603-EndpointSlice-API.md new file mode 100644 index 000000000000..ce1a694fa709 --- /dev/null +++ b/keps/sig-network/20190603-EndpointSlice-API.md @@ -0,0 +1,262 @@ +--- +title: EndpointSlice API +authors: + - "@freehan" +owning-sig: sig-network +reviewers: + - "@bowei" + - "@thockin" + - "@wojtek-t" + - "@johnbelamaric" +approvers: + - "@bowei" + - "@thockin" +creation-date: 2019-06-01 +last-updated: 2019-06-01 +status: implementable +see-also: + - "https://docs.google.com/document/d/1sLJfolOeEVzK5oOviRmtHOHmke8qtteljQPaDUEukxY/edit#" +--- +# EndpointSlice API + +## Summary + +This KEP was converted from the [original proposal doc][original-doc]. The current [Core/V1 Endpoints API][v1-endpoints-api] comes with severe performance/scalability drawbacks affecting multiple components in the control-plane (apiserver, etcd, endpoints-controller, kube-proxy). +This doc proposes a new EndpointSlice API aiming to replace Core/V1 Endpoints API for most internal consumers, including kube-proxy. +The new EndpointSlice API aims to address existing problems as well as leaving room for future extension. + + +## Motivation + +In the current Endpoints API, one object instance contains all the individual endpoints of a service. Whenever a single pod in a service is added/updated/deleted, the whole Endpoints object (even when the other endpoints didn't change) is re-computed, written to storage (etcd) and sent to all watchers (e.g. kube-proxy). This leads to 2 major problems: + +- Storing multiple megabytes of endpoints puts strain on multiple parts of the system due to not having a paging system and a monolithic watch/storage design. [The max number of endpoints is bounded by the K8s storage layer (etcd)][max-object-size]], which has a hard limit on the size of a single object (1.5MB by default). That means attempts to write an object larger than the limit will be rejected. Additionally, there is a similar limitation in the watch path in Kubernetes apiserver. For a K8s service, if its Endpoints object is too large, endpoint updates will not be propagated to kube-proxy(s), and thus iptables/ipvs won’t be reprogrammed. +- [Performance degradation in large k8s deployments.][perf-degrade] Not being able to efficiently read/update individual endpoint changes can lead to (e.g during rolling upgrade of a service) endpoints operations that are quadratic in the number of its elements. If one consider watches in the picture (there's one from each kube-proxy), the situation becomes even worse as the quadratic traffic gets multiplied further with number of watches (usually equal to #nodes in the cluster). + +The new EndpointSlice API aims to address existing problems as well as leaving room for future extension. + + +### Goal + +- Support tens of thousands of backend endpoints in a single service on cluster with thousands of nodes. +- Leave room for foreseeable extension: + - Support multiple IPs per pod + - More endpoint states than Ready/NotReady + - Dynamic endpoint subsetting + +### Non-Goal +- Change functionality provided by K8s V1 Service API. +- Provide better load balancing for K8s service backends. + +## Proposal + +### EndpointSlice API +The following new EndpointSlice API will be added to the networking API group. + +``` +type EndpointSlice struct { + metav1.TypeMeta + metav1.ObjectMeta + Spec EndpointSliceSpec +} + +type EndpointSliceSpec struct { + Endpoints []Endpoint + Ports []EndpointPort +} + +type EndpointPort struct { + // Optional + Name string + // Required + Protocol Protocol + // Optional: If unspecified, port remapping is not implemented + Port *int32 +} + +type Endpoint struct { + // Required: must contain at least one IP. + IPs []net.IP + // Optional + Hostname string + // Optional + NodeName *string + // Optional + Condition EndpointCondition + // Optional + TargetRef *ObjectReference +} + +type EndpointCondition struct { + // Matches the Ready condition on pod + Ready bool + // Matches ContainersReady condition on pod + ContainersReady bool +} + +``` + +### Mapping +- 1 Service maps to N EndpointSlice objects. +- Each EndpointSlice contains at most 100 endpoints by default (MaxEndpointThreshold: configurable via controller flag). +- For backend pods with non-uniform named ports (e.g. a service port targets a named port. Backend pods have different port number with the same port name), this would amplify the number of EndpointSlice object depending on the number of backend groups with same ports. +- EndpointSlice will be covered by resource quota. This is to limit the max number of EndpointSlice objects in one namespace. This would provide protection for k8s apiserver. For instance, a malicious user would not be able to DOS k8s API by creating services selecting all pods. + +### EndpointSlice Naming +Use generateName with service name as prefix: +``` +${service name}-${random} +``` + +### Label +For all EndpointSlice objects managed by EndpointSlice controller. The following label is added to identify corresponding service: + +- Key: k8s.io/service +- Value: ${service name} + +For self managed EndpointSlice objects, this label is not required. + +## Estimation +This section provides comparisons between Endpoints API and EndpointSlice API under 3 scenarios: +- Service Creation/Deletion +- Single Endpoint Update +- Rolling Update + + +``` +Number of Backend Pod: P +Number of Node: N +Number of Endpoint Per EndpointSlice:B +Sample Case: 20,000 endpoints, 5,000 nodes +``` + +### Service Creation/Deletion + + +| | Endpoints | 100 Endpoints per EndpointSlice | 1 Endpoint per EndpointSlice | +|--------------------------|-----------------------|---------------------------------|------------------------------| +| # of writes | O(1) | O(P/B) | O(P) | +| | 1 | 200 | 20000 | +| Size of API object | O(P) | O(B) | O(1) | +| | 20k * const = ~2.0 MB | 100 * const = ~10 KB | < ~1KB | +| # of watchers per object | O(N) | O(N) | O(N) | +| | 5000 | 5000 | 5000 | +| # of total watch event | O(N) | O(NP/B) | O(NP) | +| | 5000 | 5000 * 200 = 1,000,000 | 5000 * 20000 = 100,000,000 | +| Total Bytes Transmitted | O(PN) | O(PN) | O(PN) | +| | 2.0MB * 5000 = 10GB | 10KB * 5000 * 200 = 10GB | ~10GB | + +### Single Endpoint Update + +| | Endpoints | 100 Endpoints per EndpointSlice | 1 Endpoint per EndpointSlice | +|--------------------------|-----------------------|---------------------------------|------------------------------| +| # of writes | O(1) | O(1) | O(1) | +| | 1 | 1 | 1 | +| Size of API object | O(P) | O(B) | O(1) | +| | 20k * const = ~2.0 MB | 100 * const = ~10 KB | < ~1KB | +| # of watchers per object | O(N) | O(N) | O(N) | +| | 5000 | 5000 | 5000 | +| # of total watch event | O(N) | O(N) | O(N) | +| | 5000 | 5000 | 5000 | +| Total Bytes Transmitted | O(PN) | O(BN) | O(N) | +| | ~2.0MB * 5000 = 10GB | ~10k * 5000 = 50MB | ~1KB * 5000 = ~5MB | + + +### Rolling Update + +| | Endpoints | 100 Endpoints per EndpointSlice | 1 Endpoint per EndpointSlice | +|--------------------------|-----------------------------|---------------------------------|------------------------------| +| # of writes | O(P) | O(P) | O(P) | +| | 20k | 20k | 20k | +| Size of API object | O(P) | O(B) | O(1) | +| | 20k * const = ~2.0 MB | 100 * const = ~10 KB | < ~1KB | +| # of watchers per object | O(N) | O(N) | O(N) | +| | 5000 | 5000 | 5000 | +| # of total watch event | O(NP) | O(NP) | O(NP) | +| | 5000 * 20k | 5000 * 20k | 5000 * 20k | +| Total Bytes Transmitted | O(P^2N) | O(NPB) | O(NP) | +| | 2.0MB * 5000 * 20k = 200 TB | 10KB * 5000 * 20k = 1 TB | ~1KB * 5000 * 20k = ~100 GB | + + +## Implementation +### EndpointSlice Controller + +Watch: Service, Pod ==> Manage: EndpointSlice + +On Service Create/Update/Delete: +- `syncService(svc)` + +On Pod Create/Update/Delete: +- Reverse lookup relevant services +- For each relevant service, + - `syncService(svc)` + + +`syncService(svc)`: +- Look up selected backend pods +- For each pod + - If pod is already added into EndpointSlice and Status is correct + - Skip + - If Pod needs to be added: + - If all EndpointSlice objects has reached the MaxEndpointThreshold. Create a new EndpointSlice object. + - Find the EndpointSlice with the lowest number of endpoint, and add it into it. + - If Pod needs to be updated: + - Update the corresponding EndpointSlice + - If Pod needs to be removed: + - Remove the endpoint from EndpointSlice. + - If the # of endpoints in EndpointSlice is less than ¼ of threshold. + - Try to pack the endpoints into smaller number of EndpointSlice objects. + +### Kube-Proxy + +Watch: Service, EndpointSlice ==> Manage: iptables, ipvs, etc + +- Merge multiple EndpointSlice into an aggregated list. +- Reuse the existing processing logic + +### Endpoint Controller (classic) +In order to ensure backward compatibility for external consumer of the core/v1 Endpoints API, the existing K8s endpoint controller will keep running until the API is EOL. The following limitations will apply: + +- Starting from EndpointSlice beta: If # of endpoints in one Endpoints object exceed 100, generate a warning event to the object. +- Starting from EndpointSlice GA: Only include up to 500 endpoints in one Endpoints Object. + +## Roll Out Plan + +| K8s Version | State | OSS Controllers | Internal Consumer (Kube-proxy) | +|-------------|-------|-------------------------------------------------------------------------|--------------------------------| +| 1.16 | Alpha | EndpointSliceController (Alpha) EndpointController (GA) | Endpoints | +| 1.17 | Beta | EndpointSliceController (Beta) EndpointController (GA with warning) | EndpointSlice | +| 1.18 | GA | EndpointSliceController (GA) EndpointController (GA with limitation) | EndpointSlice | + + + +## FAQ + +- Why only include up to 100 endpoints in one EndpointSlice object? Why not 1 endpoint? Why not 1000 endpoints? + +Based on the data collected from user clusters, vast majority (> 99%) of the k8s services have less than 100 endpoints. For small services, EndpointSlice API will make no difference. If the MaxEndpointThreshold is too small (e.g. 1 endpoint per EndpointSlice), controller loses capability to batch updates, hence causing worse write amplification on service creation/deletion and scale up/down. Etcd write RPS is significant limiting factor. + +- Why do we have a status struct for each endpoint? Why not boolean state for readiness? + +The current Endpoints API only includes a boolean state (Ready vs. NotReady) on individual endpoint. However, according to pod life cycle, there are more states (e.g. Graceful Termination, ContainerReary). In order to represent additional states other than Ready/NotReady, a status structure is included for each endpoint. More condition types can be added in the future without compatibility disruptions. As more conditions are added, different consumer (e.g. different kube-proxy implementations) will have the option to evaluate the additional conditions. + + +## Graduation Criteria + +In order to graduate to beta, we need: + +- Kube-proxy switch to consume EndpointSlice API. +- Verify performance/scalability via testing. + +## Alternatives + +- increase the etcd size limits +- endpoints controller batches / rate limits changes +- apiserver batches / rate-limits watch notifications +- apimachinery to support object level pagination + + +[original-doc]: https://docs.google.com/document/d/1sLJfolOeEVzK5oOviRmtHOHmke8qtteljQPaDUEukxY/edit# +[v1-endpoints-api]: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.14/#endpoints-v1-core +[max-object-size]: https://github.com/kubernetes/kubernetes/issues/73324 +[perf-degrade]: https://github.com/kubernetes/community/blob/master/sig-scalability/blogs/k8s-services-scalability-issues.md#endpoints-traffic-is-quadratic-in-the-number-of-endpoints \ No newline at end of file