Service-level configuration #196

hbagdi · 2020-05-19T18:21:05Z

Current Status

We are collecting issues and use cases that should be associated with core.Service.
Once we have a good understanding of problems, we will start discussing potential solutions, ultimately ending in a KEP. Please feel free to add your comments to this thread and we will try to incorporate those here.

Summary

There are service-level configuration properties that cannot be specified anywhere in k8s or service-apis resources currently.
The community seems to have a consensus that core.Service is an overloaded resource with features concerning different areas and the more fields should not be added to further bloat the Service abstraction.

While these features are laid out with Ingress or Gateway in mind, some of these configurations are also valid in other areas such as clients of a Service running within the cluster perimeter or outside, Service Mesh deployments, etc.

Issues

Load balancing

Some examples of load balancing features are (not exhaustive):

Algorithm to use for load-balancing (L4/L7) and their properties
Retry behavior: If a request fails, it should be retried(L4/L7) a number of times, with delays, and to different/same endpoints
Cluster or region aware load-balancing of traffic. This also ties with Multicluster Service effort

core.Service has some aspects of load balancing using fields like SessionAffinity, and ExternalTrafficPolicy. The existing fields are (intentionally?) not exhaustive but also don’t provide any extension mechanisms.

Related issues:

Traffic management

Service upgrades are performed in a gradual way to avoid surprises. Canarying and Mirroring traffic are some of the most common ways this is performed. Ingress and mesh vendors provide such features using CRDs. Service-apis project has also seen a lot of interest in this area. While this spans multiple core.Service resources, it might be worth to explore how an end-user should think of Services and endpoints when it comes to upgrades.

Related issues:

Backend Object Traffic Manipulation #58

Health checking

While kubernetes performs health checking of pods that are associated with a Service, it is common for proxies and Load Balancer to perform health checking on their part as well.
There have been some implementations which derive the health checking info from the liveliness or readiness probes but those seem incorrect as the reasons/concerns for kubelet and the proxy are different.

Feature requests:

Active health-checking (L4/L7)
Passive health-checking a.k.a. circuit-breaking
Timeout policies (overlaps with #L4-connection)

Related issues:

L4 details

Some examples of such properties (not exhaustive):

Timeouts: connect, read, write timeouts
Capacity: number of requests per second, max number of simultaneous connections to an endpoint of the Service, maximum number of outstanding requests
keepalive settings for connection reuse
idle connection timeout policies, connection draining settings

Most of these properties have an overlap with other sections as these generic networking properties.
Related issues/PRs:

Existing workarounds

Currently, there is no good place to put these configurations anywhere.
This has led to various projects adding such properties into networking.Ingress like resources.

Such duplication cause:

Ad-hoc annotations, the community has a consensus that extensions via annotations for common use cases as the ones mentioned here are not the way to move forward
Fragmentation in the ecosystem where common properties have to defined in multiple places
Defining service-level properties in Ingress or CRDs for configuring routing rules often collide when multiple routes are pointing to the same service. Proxies resort to creating an upstream pool for each route that points to a service, which defeats the purpose of having the Service abstraction.

Misc notes

Global default

While per-service configuration is required, cluster operators would like to configure a sane global default, which should be used in absence of the any configuration.

Client vs Server

Definitions:

Server A server is the one responding to client's request. This is represented by the core.Service abstraction in k8s. Server and Service is used interchangeably in this section.
Client A client is a software or human who is accessing the server. In case of service-apis, the Gateway can be considered a client of the Service.

Configurations such as timeouts are properties of the Server and client both. A client when attempting to connect to a server, will specify a timeout. On the other hand, the server wants to protect itself from too many idle connections and will have a timeout on connections from its side.
In the context of this issue, we are referring to client configurations and not Server.

That begs the question: Why then associate such configuration to a Service and not a client controlled resource such as HTTPRoute?
I(@hbagdi) think this is because we want to define the configuration in the context of a Service and want clients to use it whenever they are communicating with the Service. There is no way to enforce that clients will use these configurations but if a client is communicating with a Service, they should follow the above defined properties if they expect certain SLAs to be met (thereby making such properties part of the contract between the client and the server/service). This could be thought of how a client should use a published port if it wishes to communicate with a Service.

Note: When considering a proxy or gateway, there are connections on two sides, one between the user and the gateway and the other between the gateway and the Service. This issue discusses the latter.

Extensibility

While standardization of above features would provide a good experience for end-users, extensibility is important. Areas like Load balancing and L4 properties can vary widely between implementations.
If a solution lacks hooks for extensions, we risk devising a solution that only works for a small population of our users and #existing-workarounds will continue to exist. The goal should not be to eliminate such workarounds but minimize them as much as possible.

kube-proxy

We need to explore where does kube-proxy fit into this. It seems to be another internal-gateway or client of the service. There are some aspects of the above that kube-proxy does implement. Should it continue to implement those? Should those be deprecated instead? Should the scope of kube-proxy be limited to not further complicate it?

The text was updated successfully, but these errors were encountered:

hbagdi · 2020-05-19T18:21:28Z

/remove kind/feature

hbagdi · 2020-05-19T18:21:53Z

cc @bowei

bowei · 2020-05-19T18:31:29Z

Thanks @hbagdi this is a great write up.

hbagdi · 2020-05-19T20:26:22Z

I think I've missed a few things so we will have to get feedback and add more to the list.

I've intentionally not included TLS-level settings and AppProtocol to the list because I'm not sure if they are service-level or not. Some of those are but not all.

hbagdi · 2020-05-19T20:32:45Z

cc @danehans
Sorry, forgot to cc you earlier, Daneyon.

bowei · 2020-05-19T22:28:21Z

/assign

jpeach · 2020-05-20T05:51:16Z

/cc @jpeach @youngnick

bowei · 2020-05-20T23:05:16Z

Discussion from office hours:

Common use case where different teams (e.g. cluster op vs app author), cluster op wants to "fix" up application settings due to regulatory, resource issues:

Cap timeouts, retries
Number of simultaneous connections

It is likely that we can't do 100% of these use cases (use Open Policy Agent) but we should keep this use case in mind.

yiyangy · 2020-05-21T23:21:31Z

A couple of questions/comments:

Is Retry a route-level or service-level configuration?
More specifically, is Retry policy specific to a route? Say if multiple routes point to the same service, shall they use the same Retry policy?
Capacity is also service-level configuration.
Capacity can be maximum number of requests per second, max number of simultaneous connections. Probably want to see if there exist common use cases that require this information.

hbagdi · 2020-05-27T16:49:37Z

Is Retry a route-level or service-level configuration?
More specifically, is Retry policy specific to a route? Say if multiple routes point to the same service, shall they use the same Retry policy?

@yiyangy, great question. I've seen this being done at both-levels in different setups. I defined this on the service-level because that's what I've seen more widely adopted. We can certainly revisit.

@youngnick mentioned in one of the meetings about how users use Gateway or proxies to fix up things. Meaning, if services haven't implemented retries or timeouts correctly, or if all services have different behavior around this, gateway is used to fix things up.

costinm · 2020-05-28T17:43:42Z

Re. Timeout and similar: https://istio.io/docs/reference/config/networking/destination-rule/

costinm · 2020-05-28T18:09:31Z

Composing by name seems like a very clean and simple solution - on 'server' namespace they would match the Service name, on client namespace they would be based on the qualified service name ( svc.namespace.svc )

danehans · 2020-05-28T18:36:15Z

My notes from the 5/28/20 meeting:

Adding these settings to a Service is not preferred due to the existing complexity of the resource.
One or more separate resources should be considered to express client and server, i.e. Service, settings.
A client and server may reside in a separate namespaces where RBAC prevents a single resource from being used to express client and server settings.
Client settings should be able to override server settings.
Gateway and HTTPRoute may override these settings.

howardjohn · 2020-05-29T14:56:48Z

One thing to note, I think there are two distinct but related topics here. All of these examples are based on client and server having an Envoy sidecar since I am familiar with this model but should generally apply.

First is where the config is actually applied. This can be

Client-side only. For example, LB algorithm can only be set by the client.
Server-side only. For example, authn/authz (technically this could be on the client, but its not actually secure).
Both sides. For example, timeouts. A server may want to set a timeout of 5s to protect itself, but a client may chose to limit their own requests to only 1s.

The second is where the config comes from. For "server-side only", the only reasonable location for the config is alongside the server (in Service or a related resource - certainly in the same namespace). However for client settings its reasonable to be able to configure defaults in the server namespace, but overrides in the client namespace.
You can also potentially have global defaults for all of these settings.

hbagdi · 2020-05-29T23:27:26Z

Hello all,
I've incorporated the feedback I've heard in this thread and from the weekly meetings into the original issue. The issue has a few new sections so please give it another pass. Thanks!

bowei · 2020-06-10T07:57:05Z

I threw together a couple of ideas here:

https://docs.google.com/document/d/1Kz2X7zKfaSGW9YTlzqFeFuTCj5uDxdB9D363nMkB6xk/edit#

fejta-bot · 2020-09-09T21:55:38Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

jpeach · 2020-09-10T02:03:12Z

/remove-lifecycle stale

maplain · 2020-11-17T23:25:05Z

@hbagdi
we also have LoadBalancerSourceRanges in Service for firewall or policy related configurations.

robscott · 2020-12-02T01:38:14Z

@hbagdi This issue has become pretty large in scope, would it be possible to split this up into smaller issues so it is easier to track progress on each part? For a bit more context, I'm trying to build out a list of what we might want to accomplish in the next API release. Ideally we'd be able to link each list item to a unique GitHub issue, likely still with this umbrella issue. Still trying to figure out the best way to structure/organize all this.

hbagdi · 2020-12-07T19:18:35Z

Update:
As Rob said, this umbrella issue has become large and served its purpose.
Some specifics before closing this issue:

Load balancing is tracked in Specify load-balancing behavior for each service that is being proxied #98
Traffic management (canarying, mirroring) are implement as HTTPRouteFilters as of v1alpha1
Healthcheck related features are already tracked in Specify health-check for each service #97 Specify circuit breaking/passive health-check behavior for each service #99 Add timeouts to destination connection details #89
Global defaults can be set in an implementation-speicfic CRD that is specified in the GatewayClass.Spec.ParametersRef.

Please let me know if something was missed here.
/close

k8s-ci-robot · 2020-12-07T19:18:42Z

@hbagdi: Closing this issue.

In response to this:

Update:
As Rob said, this umbrella issue has become large and served its purpose.
Some specifics before closing this issue:

Load balancing is tracked in Specify load-balancing behavior for each service that is being proxied #98

Traffic management (canarying, mirroring) are implement as HTTPRouteFilters as of v1alpha1

Healthcheck related features are already tracked in Specify health-check for each service #97 Specify circuit breaking/passive health-check behavior for each service #99 Add timeouts to destination connection details #89

Global defaults can be set in an implementation-speicfic CRD that is specified in the GatewayClass.Spec.ParametersRef.

Please let me know if something was missed here.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

hbagdi added the kind/feature Categorizes issue or PR as related to a new feature. label May 19, 2020

k8s-ci-robot assigned bowei May 19, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 9, 2020

hbagdi mentioned this issue Sep 10, 2020

Adds request retry attributes to HTTPRoute API Type #184

Closed

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 10, 2020

hbagdi mentioned this issue Sep 11, 2020

Adding BackendPolicy for Service-level config #268

Merged

nak3 mentioned this issue Sep 27, 2020

Proposal: Revision attribute for instance session affinity knative/serving#9039

Closed

maplain mentioned this issue Nov 18, 2020

Support for ExternalTrafficPolicy #451

Closed

k8s-ci-robot closed this as completed Dec 7, 2020

hbagdi mentioned this issue Jan 13, 2022

Explicit LoadBalancing Configurations for HTTPBackendRefs #992

Closed

jorhett mentioned this issue Nov 19, 2023

Configure ExternalTrafficPolicy field on auto-generated Services #2596

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Service-level configuration #196

Service-level configuration #196

hbagdi commented May 19, 2020 •

edited

Loading

hbagdi commented May 19, 2020

hbagdi commented May 19, 2020

bowei commented May 19, 2020

hbagdi commented May 19, 2020

hbagdi commented May 19, 2020

bowei commented May 19, 2020

jpeach commented May 20, 2020

bowei commented May 20, 2020

yiyangy commented May 21, 2020

hbagdi commented May 27, 2020

costinm commented May 28, 2020

costinm commented May 28, 2020

danehans commented May 28, 2020

howardjohn commented May 29, 2020

hbagdi commented May 29, 2020

bowei commented Jun 10, 2020

fejta-bot commented Sep 9, 2020

jpeach commented Sep 10, 2020

maplain commented Nov 17, 2020

robscott commented Dec 2, 2020 •

edited

Loading

hbagdi commented Dec 7, 2020

k8s-ci-robot commented Dec 7, 2020

Service-level configuration #196

Service-level configuration #196

Comments

hbagdi commented May 19, 2020 • edited Loading

Current Status

Summary

Issues

Load balancing

Traffic management

Health checking

L4 details

Existing workarounds

Misc notes

Global default

Client vs Server

Extensibility

kube-proxy

hbagdi commented May 19, 2020

hbagdi commented May 19, 2020

bowei commented May 19, 2020

hbagdi commented May 19, 2020

hbagdi commented May 19, 2020

bowei commented May 19, 2020

jpeach commented May 20, 2020

bowei commented May 20, 2020

yiyangy commented May 21, 2020

hbagdi commented May 27, 2020

costinm commented May 28, 2020

costinm commented May 28, 2020

danehans commented May 28, 2020

howardjohn commented May 29, 2020

hbagdi commented May 29, 2020

bowei commented Jun 10, 2020

fejta-bot commented Sep 9, 2020

jpeach commented Sep 10, 2020

maplain commented Nov 17, 2020

robscott commented Dec 2, 2020 • edited Loading

hbagdi commented Dec 7, 2020

k8s-ci-robot commented Dec 7, 2020

hbagdi commented May 19, 2020 •

edited

Loading

robscott commented Dec 2, 2020 •

edited

Loading