address KEP comments

kubernetes · Jun 28, 2019 · ee7a871 · ee7a871
1 parent 9d10f8a
commit ee7a871
Showing 1 changed file with 115 additions and 33 deletions.
diff --git a/keps/sig-network/20190603-EndpointSlice-API.md b/keps/sig-network/20190603-EndpointSlice-API.md
@@ -39,6 +39,7 @@ The new EndpointSlice API aims to address existing problems as well as leaving r
 ### Goal
 
 - Support tens of thousands of backend endpoints in a single service on cluster with thousands of nodes.
+- Move the API towards a general-purpose backend discovery API.
 - Leave room for foreseeable extension:
   - Support multiple IPs per pod
   - More endpoint states than Ready/NotReady
@@ -51,43 +52,43 @@ The new EndpointSlice API aims to address existing problems as well as leaving r
 ## Proposal
 
 ### EndpointSlice API
-The following new EndpointSlice API will be added to the networking API group. 
+The following new EndpointSlice API will be added to the `Discovery` API group.
 
 ```
 type EndpointSlice struct {
         metav1.TypeMeta
+        // OwnerReferences should be set when the object is derived from a service.
         metav1.ObjectMeta
-        Spec EndpointSliceSpec
-}
 
-type EndpointSliceSpec struct {
         Endpoints []Endpoint
+        // Each EndpiontPort must have a unique port name
         Ports []EndpointPort
 }
 
 type EndpointPort struct {
-        // The name of this port (corresponds to ServicePort.Name).
-        // Must be a DNS_LABEL.
-        // Optional only if one port is defined.
+        // Required: The name of this port.
+        // Must be a DNS_LABEL or an empty string.
         Name string 
         // Required: The IP protocol for this port.
         // Must be UDP, TCP, or SCTP.
         // Default is TCP.
         Protocol v1.Protocol
         // Optional: The port number of the endpoint.
-        // If unspecified, port remapping is not implemented.
+        // If this is not specified, ports are not restricted and must be interpreted in the context of the specific consumer.
         Port *int32
 }
 
 type Endpoint struct {
-        // Required: must contain at least one IP.
-        IPs []string
+        // Required: must contain at least one backend.
+        // This can be an IP, URL or hostname.
+        // Different consumers (e.g. kube-proxy) handle different types of backends in the context of its own capabilities.
+        Backends []string
         // Optional: The Hostname of this endpoint
         Hostname string         
         // Optional: Node hosting this endpoint. This can be used to determine endpoints local to a node.
         NodeName *string  
         // Optional: the conditions of the endpoint
-        Condition EndpointConditions
+        Conditions EndpointConditions
         // Optional: Reference to object providing the endpoint.
         TargetRef *v1.ObjectReference
 }
@@ -116,16 +117,16 @@ The endpoint port number becomes optional in the EndpointSlice API while the por
 ### EndpointSlice Naming
 Use generateName with service name as prefix:
 ```
-${service name}-${random}
+${service name}.${random}
 ```
 
 ### Label
 For all EndpointSlice objects managed by EndpointSlice controller. The following label is added to identify corresponding service:
 
-- Key: k8s.io/service
+- Key: kubernetes.io/service
 - Value: ${service name}
 
-For self managed EndpointSlice objects, this label is not required.
+For EndpointSlice instances that are not derived from kubernetes Services, this label must not be applied.
 
 ## Estimation
 This section provides comparisons between Endpoints API and EndpointSlice API under 3 scenarios:
@@ -138,8 +139,9 @@ This section provides comparisons between Endpoints API and EndpointSlice API un
 Number of Backend Pod: P
 Number of Node: N
 Number of Endpoint Per EndpointSlice:B 
-Sample Case: 20,000 endpoints, 5,000 nodes
 ```
+
+## Sample Case 1: 20,000 endpoints, 5,000 nodes
 
 ### Service Creation/Deletion
 
@@ -185,23 +187,87 @@ Sample Case: 20,000 endpoints, 5,000 nodes
 |                          | 5000                        | 5000                            | 5000                         |
 | # of total watch event   | O(NP)                       | O(NP)                           | O(NP)                        |
 |                          | 5000 * 20k                  | 5000 * 20k                      | 5000 * 20k                   |
-| Total Bytes Transmitted  | O(P^2N)                      | O(NPB)                          | O(NP)                        |
+| Total Bytes Transmitted  | O(P^2N)                     | O(NPB)                          | O(NP)                        |
 |                          | 2.0MB * 5000 * 20k = 200 TB | 10KB * 5000 * 20k = 1 TB        | ~1KB * 5000 * 20k = ~100 GB  |
 
 
+## Sample Case 2: 20 endpoints, 10 nodes
+
+### Service Creation/Deletion
+
+|                          | Endpoints             | 100 Endpoints per EndpointSlice | 1 Endpoint per EndpointSlice |
+|--------------------------|-----------------------|---------------------------------|------------------------------|
+| # of writes              | O(1)                  | O(P/B)                          | O(P)                         |
+|                          | 1                     | 1                               | 20                           |
+| Size of API object       | O(P)                  | O(B)                            | O(1)                         |
+|                          | ~1KB                  | ~1KB                            | ~1KB                         |
+| # of watchers per object | O(N)                  | O(N)                            | O(N)                         |
+|                          | 10                    | 10                              | 10                           |
+| # of total watch event   | O(N)                  | O(NP/B)                         | O(NP)                        |
+|                          | 1 * 10 = 10           | 1 * 10 = 10                     | 10 * 20 = 200                |
+| Total Bytes Transmitted  | O(PN)                 | O(PN)                           | O(PN)                        |
+|                          | ~1KB * 10 = 10KB      | ~1KB * 10 = 10KB                | ~1KB * 200 = 200KB           |
+
+### Single Endpoint Update
+
+|                          | Endpoints             | 100 Endpoints per EndpointSlice | 1 Endpoint per EndpointSlice |
+|--------------------------|-----------------------|---------------------------------|------------------------------|
+| # of writes              | O(1)                  | O(1)                            | O(1)                         |
+|                          | 1                     | 1                               | 1                            |
+| Size of API object       | O(P)                  | O(B)                            | O(1)                         |
+|                          | ~1KB                  | ~1KB                            | ~1KB                         |
+| # of watchers per object | O(N)                  | O(N)                            | O(N)                         |
+|                          | 10                    | 10                              | 10                           |
+| # of total watch event   | O(N)                  | O(N)                            | O(N)                         |
+|                          | 1                     | 1                               | 1                            |
+| Total Bytes Transmitted  | O(PN)                 | O(BN)                           | O(N)                         |
+|                          | ~1KB * 10 = 10KB      | ~1KB * 10 = 10KB                | ~1KB * 10 = 10KB             |
+
+
+### Rolling Update
+
+|                          | Endpoints                   | 100 Endpoints per EndpointSlice | 1 Endpoint per EndpointSlice |
+|--------------------------|-----------------------------|---------------------------------|------------------------------|
+| # of writes              | O(P)                        | O(P)                            | O(P)                         |
+|                          | 20                          | 20                              | 20                           |
+| Size of API object       | O(P)                        | O(B)                            | O(1)                         |
+|                          | ~1KB                        | ~1KB                            | ~1KB                         |
+| # of watchers per object | O(N)                        | O(N)                            | O(N)                         |
+|                          | 10                          | 10                              | 10                           |
+| # of total watch event   | O(NP)                       | O(NP)                           | O(NP)                        |
+|                          | 10 * 20                     | 10 * 20                         | 10 * 20                      |
+| Total Bytes Transmitted  | O(P^2N)                     | O(NPB)                          | O(NP)                        |
+|                          | ~1KB * 10 * 20 = 200KB      | ~1KB * 10 * 20 = 200KB          | ~1KB * 10 * 20 = 200KB       |
+
+
 ## Implementation
+
+### Requirements
+
+- Persistence (Minimal Churn of Endpoints)
+
+Upon service endpoint changes, the # of object writes and disruption to ongoing connections should be minimal. 
+
+- Handling Restarts & Failures
+
+The producer/consumer of EndpointSlice must be able to handle restarts and recreate state from scratch with minimal change to existing state.
+
+
 ### EndpointSlice Controller
 
+A new EndpointSlice Controller will be added to `kube-controller-manager`. It will manage the lifecycle EndpointSlice instances derived from services.  
+```
 Watch: Service, Pod ==> Manage: EndpointSlice
+```
 
+#### Workflows
 On Service Create/Update/Delete:
 - `syncService(svc)`
 
 On Pod Create/Update/Delete: 
 - Reverse lookup relevant services
 - For each relevant service, 
   - `syncService(svc)`
-
 
 `syncService(svc)`:
 - Look up selected backend pods
@@ -220,16 +286,19 @@ On Pod Create/Update/Delete:
 
 ### Kube-Proxy
 
-Watch: Service, EndpointSlice ==> Manage: iptables, ipvs, etc
+Kube-proxy will be modified to consume EndpointSlice instances besides Endpoints resource. A flag will be added to kube-proxy to toggle the mode. 
 
+```
+Watch: Service, EndpointSlice ==> Manage: iptables, ipvs, etc
+```
 - Merge multiple EndpointSlice into an aggregated list.
 - Reuse the existing processing logic 
 
 ### Endpoint Controller (classic)
 In order to ensure backward compatibility for external consumer of the core/v1 Endpoints API, the existing K8s endpoint controller will keep running until the API is EOL. The following limitations will apply:
 
-- Starting from EndpointSlice beta: If # of endpoints in one Endpoints object exceed 100, generate a warning event to the object. 
-- Starting from EndpointSlice GA: Only include up to 500 endpoints in one Endpoints Object. 
+- Starting from EndpointSlice beta: If # of endpoints in one Endpoints object exceed 500, generate a warning event to the object. 
+- Starting from EndpointSlice GA: Only include up to 1000 endpoints in one Endpoints Object. 
 
 ## Roll Out Plan
 
@@ -241,30 +310,43 @@ In order to ensure backward compatibility for external consumer of the core/v1 E
 
 
 
+
+## Graduation Criteria
+
+In order to graduate to beta, we need:
+
+- Kube-proxy switch to consume EndpointSlice API. 
+- Verify performance/scalability via testing.
+
+## Alternatives
+
+1. increase the etcd size limits
+2. endpoints controller batches / rate limits changes
+3. apiserver batches / rate-limits watch notifications
+4. apimachinery to support object level pagination
+
+
 ## FAQ
 
-- Why only include up to 100 endpoints in one EndpointSlice object? Why not 1 endpoint? Why not 1000 endpoints?
+- #### Why not pursue the alternatives?
 
-Based on the data collected from user clusters, vast majority (> 99%) of the k8s services have less than 100 endpoints. For small services, EndpointSlice API will make no difference. If the MaxEndpointThreshold is too small (e.g. 1 endpoint per EndpointSlice), controller loses capability to batch updates, hence causing worse write amplification on service creation/deletion and scale up/down. Etcd write RPS is significant limiting factor.
+In order to fulfill the goal of this proposal, without redesigning the Core/V1 Endpoints API, all items listed in the alternatives section are required. Item #1 increase maximum endpoints limitation by increasing the object size limit. This may bring other performance/scalability implications. Item #2 and #3 can reduce transmission overhead but sacrificed endpoint update latency. Item #4 can further reduce transmission overhead, however it is a big change to the existing API machinery. 
 
-- Why do we have a status struct for each endpoint? Why not boolean state for readiness?
+In summary, each of the items can only achieve incremental gain to some extent. Compared to this proposal, the combined effort would be equal or more while achieving less performance improvements. 
 
-The current Endpoints API only includes a boolean state (Ready vs. NotReady) on individual endpoint. However, according to pod life cycle, there are more states (e.g. Graceful Termination, ContainerReary). In order to represent additional states other than Ready/NotReady,  a status structure is included for each endpoint. More condition types can be added in the future without compatibility disruptions. As more conditions are added, different consumer (e.g. different kube-proxy implementations) will have the option to evaluate the additional conditions. 
+In addition, the EndpointSlice API is capable to express endpoint subsetting, which is the natural next step for improving k8s service endpoint scalability.      
 
+- #### Why only include up to 100 endpoints in one EndpointSlice object? Why not 1 endpoint? Why not 1000 endpoints?
 
-## Graduation Criteria
+Based on the data collected from user clusters, vast majority (> 99%) of the k8s services have less than 100 endpoints. For small services, EndpointSlice API will make no difference. If the MaxEndpointThreshold is too small (e.g. 1 endpoint per EndpointSlice), controller loses capability to batch updates, hence causing worse write amplification on service creation/deletion and scale up/down. Etcd write RPS is significant limiting factor.
 
-In order to graduate to beta, we need:
+- #### Why do we have a condition struct for each endpoint? 
 
-- Kube-proxy switch to consume EndpointSlice API. 
-- Verify performance/scalability via testing.
+The current Endpoints API only includes a boolean state (Ready vs. NotReady) on individual endpoint. However, according to pod life cycle, there are more states (e.g. Graceful Termination, ContainerReary). In order to represent additional states other than Ready/NotReady,  a status structure is included for each endpoint. More condition types can be added in the future without compatibility disruptions. As more conditions are added, different consumer (e.g. different kube-proxy implementations) will have the option to evaluate the additional conditions. 
 
-## Alternatives
 
-- increase the etcd size limits
-- endpoints controller batches / rate limits changes
-- apiserver batches / rate-limits watch notifications
-- apimachinery to support object level pagination
+
+
 
 
 [original-doc]: https://docs.google.com/document/d/1sLJfolOeEVzK5oOviRmtHOHmke8qtteljQPaDUEukxY/edit#