Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Service-level configuration #196

Closed
hbagdi opened this issue May 19, 2020 · 22 comments
Closed

Service-level configuration #196

hbagdi opened this issue May 19, 2020 · 22 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@hbagdi
Copy link
Contributor

hbagdi commented May 19, 2020

Current Status

We are collecting issues and use cases that should be associated with core.Service.
Once we have a good understanding of problems, we will start discussing potential solutions, ultimately ending in a KEP. Please feel free to add your comments to this thread and we will try to incorporate those here.

Summary

There are service-level configuration properties that cannot be specified anywhere in k8s or service-apis resources currently.
The community seems to have a consensus that core.Service is an overloaded resource with features concerning different areas and the more fields should not be added to further bloat the Service abstraction.

While these features are laid out with Ingress or Gateway in mind, some of these configurations are also valid in other areas such as clients of a Service running within the cluster perimeter or outside, Service Mesh deployments, etc.

Issues

Load balancing

Some examples of load balancing features are (not exhaustive):

  • Algorithm to use for load-balancing (L4/L7) and their properties
  • Retry behavior: If a request fails, it should be retried(L4/L7) a number of times, with delays, and to different/same endpoints
  • Cluster or region aware load-balancing of traffic. This also ties with Multicluster Service effort

core.Service has some aspects of load balancing using fields like SessionAffinity, and ExternalTrafficPolicy. The existing fields are (intentionally?) not exhaustive but also don’t provide any extension mechanisms.

Related issues:

Traffic management

Service upgrades are performed in a gradual way to avoid surprises. Canarying and Mirroring traffic are some of the most common ways this is performed. Ingress and mesh vendors provide such features using CRDs. Service-apis project has also seen a lot of interest in this area. While this spans multiple core.Service resources, it might be worth to explore how an end-user should think of Services and endpoints when it comes to upgrades.

Related issues:

Health checking

While kubernetes performs health checking of pods that are associated with a Service, it is common for proxies and Load Balancer to perform health checking on their part as well.
There have been some implementations which derive the health checking info from the liveliness or readiness probes but those seem incorrect as the reasons/concerns for kubelet and the proxy are different.

Feature requests:

  • Active health-checking (L4/L7)
  • Passive health-checking a.k.a. circuit-breaking
  • Timeout policies (overlaps with #L4-connection)

Related issues:

L4 details

Some examples of such properties (not exhaustive):

  • Timeouts: connect, read, write timeouts
  • Capacity: number of requests per second, max number of simultaneous connections to an endpoint of the Service, maximum number of outstanding requests
  • keepalive settings for connection reuse
  • idle connection timeout policies, connection draining settings

Most of these properties have an overlap with other sections as these generic networking properties.
Related issues/PRs:

Existing workarounds

Currently, there is no good place to put these configurations anywhere.
This has led to various projects adding such properties into networking.Ingress like resources.

Such duplication cause:

  • Ad-hoc annotations, the community has a consensus that extensions via annotations for common use cases as the ones mentioned here are not the way to move forward
  • Fragmentation in the ecosystem where common properties have to defined in multiple places
  • Defining service-level properties in Ingress or CRDs for configuring routing rules often collide when multiple routes are pointing to the same service. Proxies resort to creating an upstream pool for each route that points to a service, which defeats the purpose of having the Service abstraction.

Misc notes

Global default

While per-service configuration is required, cluster operators would like to configure a sane global default, which should be used in absence of the any configuration.

Client vs Server

Definitions:

  • Server A server is the one responding to client's request. This is represented by the core.Service abstraction in k8s. Server and Service is used interchangeably in this section.
  • Client A client is a software or human who is accessing the server. In case of service-apis, the Gateway can be considered a client of the Service.

Configurations such as timeouts are properties of the Server and client both. A client when attempting to connect to a server, will specify a timeout. On the other hand, the server wants to protect itself from too many idle connections and will have a timeout on connections from its side.
In the context of this issue, we are referring to client configurations and not Server.

That begs the question: Why then associate such configuration to a Service and not a client controlled resource such as HTTPRoute?
I(@hbagdi) think this is because we want to define the configuration in the context of a Service and want clients to use it whenever they are communicating with the Service. There is no way to enforce that clients will use these configurations but if a client is communicating with a Service, they should follow the above defined properties if they expect certain SLAs to be met (thereby making such properties part of the contract between the client and the server/service). This could be thought of how a client should use a published port if it wishes to communicate with a Service.

Note: When considering a proxy or gateway, there are connections on two sides, one between the user and the gateway and the other between the gateway and the Service. This issue discusses the latter.

Extensibility

While standardization of above features would provide a good experience for end-users, extensibility is important. Areas like Load balancing and L4 properties can vary widely between implementations.
If a solution lacks hooks for extensions, we risk devising a solution that only works for a small population of our users and #existing-workarounds will continue to exist. The goal should not be to eliminate such workarounds but minimize them as much as possible.

kube-proxy

We need to explore where does kube-proxy fit into this. It seems to be another internal-gateway or client of the service. There are some aspects of the above that kube-proxy does implement. Should it continue to implement those? Should those be deprecated instead? Should the scope of kube-proxy be limited to not further complicate it?

@hbagdi hbagdi added the kind/feature Categorizes issue or PR as related to a new feature. label May 19, 2020
@hbagdi
Copy link
Contributor Author

hbagdi commented May 19, 2020

/remove kind/feature

@hbagdi
Copy link
Contributor Author

hbagdi commented May 19, 2020

cc @bowei

@bowei
Copy link
Contributor

bowei commented May 19, 2020

Thanks @hbagdi this is a great write up.

@hbagdi
Copy link
Contributor Author

hbagdi commented May 19, 2020

I think I've missed a few things so we will have to get feedback and add more to the list.

I've intentionally not included TLS-level settings and AppProtocol to the list because I'm not sure if they are service-level or not. Some of those are but not all.

@hbagdi
Copy link
Contributor Author

hbagdi commented May 19, 2020

cc @danehans
Sorry, forgot to cc you earlier, Daneyon.

@bowei
Copy link
Contributor

bowei commented May 19, 2020

/assign

@jpeach
Copy link
Contributor

jpeach commented May 20, 2020

/cc @jpeach @youngnick

@bowei
Copy link
Contributor

bowei commented May 20, 2020

Discussion from office hours:

Common use case where different teams (e.g. cluster op vs app author), cluster op wants to "fix" up application settings due to regulatory, resource issues:

  1. Cap timeouts, retries
  2. Number of simultaneous connections

It is likely that we can't do 100% of these use cases (use Open Policy Agent) but we should keep this use case in mind.

@yiyangy
Copy link
Contributor

yiyangy commented May 21, 2020

A couple of questions/comments:

  • Is Retry a route-level or service-level configuration?
    More specifically, is Retry policy specific to a route? Say if multiple routes point to the same service, shall they use the same Retry policy?

  • Capacity is also service-level configuration.
    Capacity can be maximum number of requests per second, max number of simultaneous connections. Probably want to see if there exist common use cases that require this information.

@hbagdi
Copy link
Contributor Author

hbagdi commented May 27, 2020

Is Retry a route-level or service-level configuration?
More specifically, is Retry policy specific to a route? Say if multiple routes point to the same service, shall they use the same Retry policy?

@yiyangy, great question. I've seen this being done at both-levels in different setups. I defined this on the service-level because that's what I've seen more widely adopted. We can certainly revisit.

@youngnick mentioned in one of the meetings about how users use Gateway or proxies to fix up things. Meaning, if services haven't implemented retries or timeouts correctly, or if all services have different behavior around this, gateway is used to fix things up.

@costinm
Copy link

costinm commented May 28, 2020

@costinm
Copy link

costinm commented May 28, 2020

Composing by name seems like a very clean and simple solution - on 'server' namespace they would match the Service name, on client namespace they would be based on the qualified service name ( svc.namespace.svc )

@danehans
Copy link
Contributor

My notes from the 5/28/20 meeting:

  • Adding these settings to a Service is not preferred due to the existing complexity of the resource.
  • One or more separate resources should be considered to express client and server, i.e. Service, settings.
  • A client and server may reside in a separate namespaces where RBAC prevents a single resource from being used to express client and server settings.
  • Client settings should be able to override server settings.
  • Gateway and HTTPRoute may override these settings.

@howardjohn
Copy link
Contributor

One thing to note, I think there are two distinct but related topics here. All of these examples are based on client and server having an Envoy sidecar since I am familiar with this model but should generally apply.

First is where the config is actually applied. This can be

  • Client-side only. For example, LB algorithm can only be set by the client.
  • Server-side only. For example, authn/authz (technically this could be on the client, but its not actually secure).
  • Both sides. For example, timeouts. A server may want to set a timeout of 5s to protect itself, but a client may chose to limit their own requests to only 1s.

The second is where the config comes from. For "server-side only", the only reasonable location for the config is alongside the server (in Service or a related resource - certainly in the same namespace). However for client settings its reasonable to be able to configure defaults in the server namespace, but overrides in the client namespace.
You can also potentially have global defaults for all of these settings.

@hbagdi
Copy link
Contributor Author

hbagdi commented May 29, 2020

Hello all,
I've incorporated the feedback I've heard in this thread and from the weekly meetings into the original issue. The issue has a few new sections so please give it another pass. Thanks!

@bowei
Copy link
Contributor

bowei commented Jun 10, 2020

I threw together a couple of ideas here:

https://docs.google.com/document/d/1Kz2X7zKfaSGW9YTlzqFeFuTCj5uDxdB9D363nMkB6xk/edit#

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 9, 2020
@jpeach
Copy link
Contributor

jpeach commented Sep 10, 2020

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 10, 2020
@maplain
Copy link

maplain commented Nov 17, 2020

@hbagdi
we also have LoadBalancerSourceRanges in Service for firewall or policy related configurations.

@robscott
Copy link
Member

robscott commented Dec 2, 2020

@hbagdi This issue has become pretty large in scope, would it be possible to split this up into smaller issues so it is easier to track progress on each part? For a bit more context, I'm trying to build out a list of what we might want to accomplish in the next API release. Ideally we'd be able to link each list item to a unique GitHub issue, likely still with this umbrella issue. Still trying to figure out the best way to structure/organize all this.

@hbagdi
Copy link
Contributor Author

hbagdi commented Dec 7, 2020

Update:
As Rob said, this umbrella issue has become large and served its purpose.
Some specifics before closing this issue:

Please let me know if something was missed here.
/close

@k8s-ci-robot
Copy link
Contributor

@hbagdi: Closing this issue.

In response to this:

Update:
As Rob said, this umbrella issue has become large and served its purpose.
Some specifics before closing this issue:

Please let me know if something was missed here.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests