Skip to content

Commit

Permalink
Favor EndpointSlice over Endpoints
Browse files Browse the repository at this point in the history
Document EndpointSlice as the preferred and most appropriate mechanism
to record the backing endpoints of a Service.

Co-authored-by: Rob Scott <[email protected]>
Co-authored-by: Shannon Kularathna <[email protected]>
  • Loading branch information
3 people committed Sep 14, 2022
1 parent 77b688c commit b12d675
Show file tree
Hide file tree
Showing 12 changed files with 172 additions and 113 deletions.
4 changes: 2 additions & 2 deletions content/en/docs/concepts/architecture/cloud-controller.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,11 +107,11 @@ routes appropriately. It requires Get access to Node objects.

### Service controller {#authorization-service-controller}

The service controller listens to Service object Create, Update and Delete events and then configures Endpoints for those Services appropriately.
The service controller listens to Service object Create, Update and Delete events and then configures EndpointSlices (and Endpoints) for those Services appropriately.

To access Services, it requires List, and Watch access. To update Services, it requires Patch and Update access.

To set up Endpoints resources for the Services, it requires access to Create, List, Get, Watch, and Update.
To set up EndpointSlice / Endpoints resources for the Services, it requires access to Create, List, Get, Watch, and Update.

`v1/Service`:

Expand Down
4 changes: 2 additions & 2 deletions content/en/docs/concepts/overview/components.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,8 +53,8 @@ Some types of these controllers are:
* Node controller: Responsible for noticing and responding when nodes go down.
* Job controller: Watches for Job objects that represent one-off tasks, then creates
Pods to run those tasks to completion.
* Endpoints controller: Populates the Endpoints object (that is, joins Services & Pods).
* Service Account & Token controllers: Create default accounts and API access tokens for new namespaces.
* EndpointSlice controller: Populates EndpointSlice objects (to provide a link between Services and Pods).
* ServiceAccount controller: Create default ServiceAccounts for new namespaces.

### cloud-controller-manager

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -92,10 +92,14 @@ my-nginx ClusterIP 10.0.162.149 <none> 80/TCP 21s
```
As mentioned previously, a Service is backed by a group of Pods. These Pods are
exposed through `endpoints`. The Service's selector will be evaluated continuously
and the results will be POSTed to an Endpoints object also named `my-nginx`.
When a Pod dies, it is automatically removed from the endpoints, and new Pods
matching the Service's selector will automatically get added to the endpoints.
exposed through
{{<glossary_tooltip term_id="endpoint-slice" text="EndpointSlices">}}.
The Service's selector will be evaluated continuously and the results will be POSTed
to an EndpointSlice that is connected to the Service using a
{{< glossary_tooltip text="labels" term_id="label" >}}.
When a Pod dies, it is automatically removed from the EndpointSlices that contain it
as an endpoint. New Pods that match the Service's selector will automatically get added
to an EndpointSlice for that Service.
Check the endpoints, and note that the IPs are the same as the Pods created in
the first step:
Expand All @@ -116,11 +120,11 @@ Session Affinity: None
Events: <none>
```
```shell
kubectl get ep my-nginx
kubectl get endpointslices -l kubernetes.io/service-name=my-nginx
```
```
NAME ENDPOINTS AGE
my-nginx 10.244.2.5:80,10.244.3.4:80 1m
NAME ADDRESSTYPE PORTS ENDPOINTS AGE
my-nginx-7vzhx IPv4 80 10.244.2.5,10.244.3.4 21s
```
You should now be able to curl the nginx Service on `<CLUSTER-IP>:<PORT>` from
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -183,8 +183,8 @@ the same namespace, the Pod will see its own FQDN as
A or AAAA record at that name, pointing to the Pod's IP. Both Pods "`busybox1`" and
"`busybox2`" can have their distinct A or AAAA records.

The Endpoints object can specify the `hostname` for any endpoint addresses,
along with its IP.
An {{<glossary_tooltip term_id="endpoint-slice" text="EndpointSlice">}} can specify
the DNS hostname for any endpoint addresses, along with its IP.

{{< note >}}
Because A or AAAA records are not created for Pod names, `hostname` is required for the Pod's A or AAAA
Expand Down
71 changes: 45 additions & 26 deletions content/en/docs/concepts/services-networking/endpoint-slices.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,24 +19,7 @@ Endpoints.

<!-- body -->

## Motivation

The Endpoints API has provided a simple and straightforward way of
tracking network endpoints in Kubernetes. Unfortunately as Kubernetes clusters
and {{< glossary_tooltip text="Services" term_id="service" >}} have grown to handle and
send more traffic to more backend Pods, limitations of that original API became
more visible.
Most notably, those included challenges with scaling to larger numbers of
network endpoints.

Since all network endpoints for a Service were stored in a single Endpoints
resource, those resources could get quite large. That affected the performance
of Kubernetes components (notably the master control plane) and resulted in
significant amounts of network traffic and processing when Endpoints changed.
EndpointSlices help you mitigate those issues as well as provide an extensible
platform for additional features such as topological routing.

## EndpointSlice resources {#endpointslice-resource}
## EndpointSlice API {#endpointslice-resource}

In Kubernetes, an EndpointSlice contains references to a set of network
endpoints. The control plane automatically creates EndpointSlices
Expand All @@ -48,7 +31,7 @@ Service name.
The name of a EndpointSlice object must be a valid
[DNS subdomain name](/docs/concepts/overview/working-with-objects/names#dns-subdomain-names).

As an example, here's a sample EndpointSlice resource for the `example`
As an example, here's a sample EndpointSlice object, that's owned by the `example`
Kubernetes Service.

```yaml
Expand Down Expand Up @@ -81,8 +64,7 @@ flag, up to a maximum of 1000.

EndpointSlices can act as the source of truth for
{{< glossary_tooltip term_id="kube-proxy" text="kube-proxy" >}} when it comes to
how to route internal traffic. When enabled, they should provide a performance
improvement for services with large numbers of endpoints.
how to route internal traffic.

### Address types

Expand All @@ -92,6 +74,10 @@ EndpointSlices support three address types:
* IPv6
* FQDN (Fully Qualified Domain Name)

Each `EndpointSlice` object represents a specific IP address type. If you have
a Service that is available via IPv4 and IPv6, there will be at least two
`EndpointSlice` objects (one for IPv4, and one for IPv6).

### Conditions

The EndpointSlice API stores conditions about endpoints that may be useful for consumers.
Expand Down Expand Up @@ -241,11 +227,44 @@ getting replaced.

Due to the nature of EndpointSlice changes, endpoints may be represented in more
than one EndpointSlice at the same time. This naturally occurs as changes to
different EndpointSlice objects can arrive at the Kubernetes client watch/cache
at different times. Implementations using EndpointSlice must be able to have the
endpoint appear in more than one slice. A reference implementation of how to
perform endpoint deduplication can be found in the `EndpointSliceCache`
implementation in `kube-proxy`.
different EndpointSlice objects can arrive at the Kubernetes client watch / cache
at different times.

{{< note >}}
Clients of the EndpointSlice API must be able to handle the situation where
a particular endpoint address appears in more than one slice.

You can find a reference implementation for how to perform this endpoint deduplication
as part of the `EndpointSliceCache` code within `kube-proxy`.
{{< /note >}}

## Comparison with Endpoints {#motivation}

The original Endpoints API provided a simple and straightforward way of
tracking network endpoints in Kubernetes. As Kubernetes clusters
and {{< glossary_tooltip text="Services" term_id="service" >}} grew to handle
more traffic and to send more traffic to more backend Pods, the
limitations of that original API became more visible.
Most notably, those included challenges with scaling to larger numbers of
network endpoints.

Since all network endpoints for a Service were stored in a single Endpoints
object, those Endpoints objects could get quite large. For Services that stayed
stable (the same set of endpoints over a long period of time) the impact was
acceptable; however, some use cases of Kubernetes weren't well served.

When a Service had a lot of endpoints and the service was either scaling
frequently, or rolling out new changes frequently, each update to the single
Endpoints object for that Service meant a lot of traffic between Kubernetes
cluster components (within the control plane, and also between nodes and the
API server). This extra traffic also had a cost in terms of CPU use.

With EndpointSlices, adding or removing a single Pod triggers the same _number_
of updates to clients that are watching for changes, but the size of those
update message is much smaller at large scale.

EndpointSlices also enabled innovation around new features such as topology-aware
routing.

## {{% heading "whatsnext" %}}

Expand Down
133 changes: 81 additions & 52 deletions content/en/docs/concepts/services-networking/service.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,8 @@ The Service abstraction enables this decoupling.

If you're able to use Kubernetes APIs for service discovery in your application,
you can query the {{< glossary_tooltip text="API server" term_id="kube-apiserver" >}}
for Endpoints, that get updated whenever the set of Pods in a Service changes.
for matching EndpointSlices. Kubernetes updates the EndpointSlices for a Service
whenever the set of Pods in a Service changes.

For non-native applications, Kubernetes offers ways to place a network port or load
balancer in between your application and the backend Pods.
Expand Down Expand Up @@ -159,8 +160,12 @@ Each port definition can have the same `protocol`, or a different one.
### Services without selectors

Services most commonly abstract access to Kubernetes Pods thanks to the selector,
but when used with a corresponding Endpoints object and without a selector, the Service can abstract other kinds of backends,
including ones that run outside the cluster. For example:
but when used with a corresponding set of
{{<glossary_tooltip term_id="endpoint-slice" text="EndpointSlices">}}
objects and without a selector, the Service can abstract other kinds of backends,
including ones that run outside the cluster.

For example:

* You want to have an external database cluster in production, but in your
test environment you use your own databases.
Expand All @@ -184,73 +189,94 @@ spec:
targetPort: 9376
```

Because this Service has no selector, the corresponding Endpoints object is not
created automatically. You can manually map the Service to the network address and port
where it's running, by adding an Endpoints object manually:
Because this Service has no selector, the corresponding EndpointSlice (and
legacy Endpoints) objects are not created automatically. You can manually map the Service
to the network address and port where it's running, by adding an EndpointSlice
object manually. For example:

```yaml
apiVersion: v1
kind: Endpoints
apiVersion: discovery.k8s.io/v1
kind: EndpointSlice
metadata:
# the name here should match the name of the Service
name: my-service
subsets:
name: my-service-1 # by convention, use the name of the Service
# as a prefix for the name of the EndpointSlice
labels:
# You should set the "kubernetes.io/service-name" label.
# Set its value to match the name of the Service
kubernetes.io/service-name: my-service
addressType: IPv4
ports:
- name: '' # empty because port 9376 is not assigned as a well-known
# port (by IANA)
appProtocol: http
protocol: TCP
port: 9376
endpoints:
- addresses:
- ip: 192.0.2.42
ports:
- port: 9376
- "10.4.5.6" # the IP addresses in this list can appear in any order
- "10.1.2.3"
```

The name of the Endpoints object must be a valid
[DNS subdomain name](/docs/concepts/overview/working-with-objects/names#dns-subdomain-names).

When you create an [Endpoints](/docs/reference/kubernetes-api/service-resources/endpoints-v1/)
object for a Service, you set the name of the new object to be the same as that
of the Service.
When you create an [EndpointSlice](#endpointslices) object for a Service, you can
use any name for the EndpointSlice. Each EndpointSlice in a namespace must have a
unique name. You link an EndpointSlice to a Service by setting the
`kubernetes.io/service-name` {{< glossary_tooltip text="label" term_id="label" >}}
on that EndpointSlice.

{{< note >}}
The endpoint IPs _must not_ be: loopback (127.0.0.0/8 for IPv4, ::1/128 for IPv6), or
link-local (169.254.0.0/16 and 224.0.0.0/24 for IPv4, fe80::/64 for IPv6).

Endpoint IP addresses cannot be the cluster IPs of other Kubernetes Services,
The endpoint IP addresses cannot be the cluster IPs of other Kubernetes Services,
because {{< glossary_tooltip term_id="kube-proxy" >}} doesn't support virtual IPs
as a destination.
{{< /note >}}

Accessing a Service without a selector works the same as if it had a selector.
In the example above, traffic is routed to the single endpoint defined in
the YAML: `192.0.2.42:9376` (TCP).

{{< note >}}
The Kubernetes API server does not allow proxying to endpoints that are not mapped to
pods. Actions such as `kubectl proxy <service-name>` where the service has no
selector will fail due to this constraint. This prevents the Kubernetes API server
from being used as a proxy to endpoints the caller may not be authorized to access.
{{< /note >}}
In the example above, traffic is routed to one of the two endpoints defined in
the EndpointSlice manifest: a TCP connection to 10.1.2.3 or 10.4.5.6, on port 9376.

An ExternalName Service is a special case of Service that does not have
selectors and uses DNS names instead. For more information, see the
[ExternalName](#externalname) section later in this document.

### Over Capacity Endpoints
If an Endpoints resource has more than 1000 endpoints then a Kubernetes v1.22 (or later)
cluster annotates that Endpoints with `endpoints.kubernetes.io/over-capacity: truncated`.
This annotation indicates that the affected Endpoints object is over capacity and that
the endpoints controller has truncated the number of endpoints to 1000.

### EndpointSlices

{{< feature-state for_k8s_version="v1.21" state="stable" >}}

EndpointSlices are an API resource that can provide a more scalable alternative
to Endpoints. Although conceptually quite similar to Endpoints, EndpointSlices
allow for distributing network endpoints across multiple resources. By default,
an EndpointSlice is considered "full" once it reaches 100 endpoints, at which
point additional EndpointSlices will be created to store any additional
endpoints.
[EndpointSlices](/docs/concepts/services-networking/endpoint-slices/) are objects that
represent a subset (a _slice_) of the backing network endpoints for a Service.

Your Kubernetes cluster tracks how many endpoints each EndpointSlice represents.
If there are so many endpoints for a Service that a threshold is reached, then
Kubernetes adds another empty EndpointSlice and stores new endpoint information
there.
By default, Kubernetes makes a new EndpointSlice once the existing EndpointSlices
all contain at least 100 endpoints. Kubernetes does not make the new EndpointSlice
until an extra endpoint needs to be added.

See [EndpointSlices](/docs/concepts/services-networking/endpoint-slices/) for more
information about this API.

### Endpoints

In the Kubernetes API, an
[Endpoints](/docs/reference/kubernetes-api/service-resources/endpoints-v1/)
(the resource kind is plural) defines a list of network endpoints, typically
referenced by a Service to define which Pods the traffic can be sent to.

The EndpointSlice API is the recommended replacement for Endpoints.

#### Over-capacity endpoints

If an Endpoints object has more than 1000 endpoints then a Kubernetes v1.22 (or later)
cluster annotates that Endpoints with `endpoints.kubernetes.io/over-capacity: truncated`.
This {{< glossary_tooltip text="annotation" term_id="annotation" >}} indicates that the
affected Endpoints object is over capacity and that the endpoints controller has truncated
the number of endpoints to 1000.

EndpointSlices provide additional attributes and functionality which is
described in detail in [EndpointSlices](/docs/concepts/services-networking/endpoint-slices/).
Traffic is still sent to backends, but any load balancing mechanism that relies on the
legacy Endpoints API only sends traffic to at most 1000 of the available backing endpoints.

### Application protocol

Expand Down Expand Up @@ -571,19 +597,22 @@ selectors defined:

### With selectors

For headless Services that define selectors, the endpoints controller creates
`Endpoints` records in the API, and modifies the DNS configuration to return
A records (IP addresses) that point directly to the `Pods` backing the `Service`.
For headless Services that define selectors, the Kubernetes control plane creates
EndpointSlice objects in the Kubernetes API, and modifies the DNS configuration to return
A or AAAA records (IPv4 or IPv6 addresses) that point directly to the Pods backing
the Service.

### Without selectors

For headless Services that do not define selectors, the endpoints controller does
not create `Endpoints` records. However, the DNS system looks for and configures
For headless Services that do not define selectors, the control plane does
not create EndpointSlice objects. However, the DNS system looks for and configures
either:

* CNAME records for [`ExternalName`](#externalname)-type Services.
* A records for any `Endpoints` that share a name with the Service, for all
other types.
* DNS CNAME records for [`type: ExternalName`](#externalname) Services.
* DNS A / AAAA records for all IP addresses of the Service's ready endpoints,
for all Service types other than `ExternalName`.
* For IPv4 endpoints, the DNS system creates A records.
* For IPv6 endpoints, the DNS system creates AAAA records.

## Publishing Services (ServiceTypes) {#publishing-services-service-types}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ weight: 45

_Topology Aware Hints_ enable topology aware routing by including suggestions
for how clients should consume endpoints. This approach adds metadata to enable
consumers of EndpointSlice and / or Endpoints objects, so that traffic to
consumers of EndpointSlice (or Endpoints) objects, so that traffic to
those network endpoints can be routed closer to where it originated.

For example, you can route traffic within a locality to reduce
Expand Down
2 changes: 1 addition & 1 deletion content/en/docs/concepts/workloads/pods/pod-lifecycle.md
Original file line number Diff line number Diff line change
Expand Up @@ -461,7 +461,7 @@ An example flow:
order. If the order of shutdowns matters, consider using a `preStop` hook to synchronize.
{{< /note >}}
1. At the same time as the kubelet is starting graceful shutdown, the control plane removes that
shutting-down Pod from Endpoints (and, if enabled, EndpointSlice) objects where these represent
shutting-down Pod from EndpointSlice (and Endpoints) objects where these represent
a {{< glossary_tooltip term_id="service" text="Service" >}} with a configured
{{< glossary_tooltip text="selector" term_id="selector" >}}.
{{< glossary_tooltip text="ReplicaSets" term_id="replica-set" >}} and other workload resources
Expand Down
Loading

0 comments on commit b12d675

Please sign in to comment.