Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extensible Service policy and configuration #611

Closed
mark-church opened this issue Apr 13, 2021 · 6 comments · Fixed by #736
Closed

Extensible Service policy and configuration #611

mark-church opened this issue Apr 13, 2021 · 6 comments · Fixed by #736
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Milestone

Comments

@mark-church
Copy link
Contributor

mark-church commented Apr 13, 2021

This proposal draws inspiration from our current TLS design and offers a uniform method of attaching arbitrary and implementation-specific configuration to the Gateway API core resources.

It aims to achieve three different goals that I think are very important for a standard API spec. This is based on the assumptions that 1) every implementation will have proprietary functionality that's also unique and 2) we can't predict what this functionality is or where in the load balancing hierarchy it should be configured. As a result, to make Gateway a standard that can actually replace proprietary APIs, it must have a robust way of accomplishing the following three things:

  1. Provide a standard way of extending arbitrary and proprietary configuration and policy on top of the core resources of the Gateway API
  2. Provide the ability to configure (arbitrary and proprietary) policy across different scopes of the load balancing infrastructure, from GatewayClass, to Gateway, to HTTPRoute, to Services, to individual workloads.
  3. Provide a way for policy to compose: policy should be able to be configurable simultaneously at multiple scopes (both broad and granular) by non-coordinating users and that policy should compose together to a single effective policy for a given service.

This is just a proposal so I'm happy to take any suggestions on what should change or how we can make it better. cc @bowei @thockin @robscott @hbagdi @youngnick @danehans @howardjohn

@mark-church mark-church added the kind/feature Categorizes issue or PR as related to a new feature. label Apr 13, 2021
@youngnick
Copy link
Contributor

youngnick commented Apr 14, 2021

(ported from a Slack comment)

We had a lot of similar problems with HTTPProxy, Contour's own CRD, which we tried solving in a similar way, but we found it very quickly grew unwieldy because of the interactions between different policy options.

The best example for HTTPProxy was client request timeout. In HTTPProxy's API, we allow setting this to the special value "infinity", which means that Envoy will disable this timeout. However, when running a central ingress proxy, some organisations don't want to allow people to do this. We experimented with a few ways to do this in the API, and eventually just recommended that, if you need that level of granularity, you use Gatekeeper as an admission controller, and we provide example PolicyTemplates and Policies. That lets an infra admin put strong bounds around settings, on a per-field level, that give early UX feedback (objects that violate the policy are never accepted). It also lets administrators who need even greater flexibility build that for themselves (if you want to allow a single namespace to set client request timeout to infinity, that's very possible with Gatekeeper).

That's not to say that we shouldn't do what you're saying in the doc (I've taken a quick read, will go over again momentarily), just that we should consider the UX and portability of the API as well.

I agree that the API will never succeed without a relatively standard way for implementations to do their own custom things in many places. But we need to make sure that we have a process for promoting features from ImplementationSpecific -> Extended -> Core, and what that would mean for various conformance profiles (once we have them). In particular, I think that settings like Minimum TLS version are great candidates for a relatively rapid advancement through the levels.

Without a strong focus on promoting common things out of ImplementationSpecific CRDs, then it will be very difficult to get the portabiility gains we're shooting for with this API.

@stevesloka
Copy link
Contributor

In the early days of IngressRoute (which was the predecessor to HTTPProxy, we had similar designs that you have @mark-church, where you could define "defaults" for say a vhost a a high level which defined how all the routes would operator (e.g. timeouts, load balance algo, etc), but then users could have the ability to overwrite if they needed for some specific reason (say one backend was slow and needed larger timeouts for instance).

I'm wondering if the policies should be first applied to GatewayClasses giving you that set of "defaults" that can be defined?

@mark-church
Copy link
Contributor Author

Thanks @stevesloka and @youngnick - on one hand I'm glad that we independently came to similar solutions, on the other hand, I'm bummed to hear that you also found no "slam dunk" proposal. I agree that some imperfect way of extended the API with custom fields is likely better than none, or else implementers will always continue to support their proprietary APIs for the advanced functionality

we need to make sure that we have a process for promoting features from ImplementationSpecific -> Extended -> Core

I agree. If there isn't forward motion of these fields into extended and core then it shows that Gateway API is not evolving. I think there a couple things that we can do to promote this conveyer belt:

  • Document principles/standards around what kind of things should be in custom/extended/core. Here are my personal thoughts on it:
    • If 100% of the Gateway API implementors can support a field consistently then it can be in core
    • If 51% of the Gateway API implementors can support a field in a 100% consistent way, then it can be in extended
    • Custom is for everything else
  • We should also hold regular reviews to go through the requests to see if anything can be moved up on the conveyer belt. Some things will naturally never become core or extended because they are implemented too diversely (timeouts are a good example).

We experimented with a few ways to do this in the API

Are there any proposals that you can share that we can learn from?

That lets an infra admin put strong bounds around settings, on a per-field level, that give early UX feedback (objects that violate the policy are never accepted).

Technically ServicePolicies don't prevent Gatekeeper from being used in this manner. For example, ServicePolicies may be freely configureable at the GatewayClass and Gateway levels, but at the HTTPRoute and Service levels Gatekeeper could be used to totally disallow certain fields or certain values for immediate feedback whenever a ServicePolicy is referenced.

I'm wondering if the policies should be first applied to GatewayClasses giving you that set of "defaults" that can be defined?

Good question, I originally didn't think of ServicePolicies for GatewayClasses, but I do think it makes perfect sense. A ServicePolicy referenced by a GatewayClass is effectively a default that appears intrinsic to the GatewayClass, but ServicePolicies would give us a nice API for making that more formalized and configurable. I think it would be a good way of making default clear without even requiring documentation to know what the defaults are (just look at your GatewayClass). Check out #583 (comment) for an example.

One of the reasons that I proposed policies as a list is so that default policies and enforced policies can both be set on the same resource. I'm not sure if this would be complimentary or a replacement of GatewayClass parameters, but it does provide a very clear way to communicate what is enforced and what is default, as in the following example:

kind: GatewayClass
metadata:
  name: acme-gatewayclass
spec:
  controller: acme.io/gateway-controller
  policies:
  - kind: ServicePolicy
    group: acme.io/v1
    name: enforced-gwc-policy
    policyOverride: false
  - kind: ServicePolicy
    group: acme.io/v1
    name: default-gwc-policy
    policyOverride: true
---
apiVersion: acme.io/v1
kind: ServicePolicy
name: default-gwc-policy
spec: 
  retries:
    attempts: 3
    timeout: 2
  backendTimeout: 5
---
apiVersion: acme.io/v1
kind: ServicePolicy
name: enforced-gwc-policy
spec: 
  minTLSVersion: "1.2"

@youngnick
Copy link
Contributor

Sorry to take a while to come back to this one @mark-church.

  • Document principles/standards around what kind of things should be in custom/extended/core. Here are my personal thoughts on it:

    • If 100% of the Gateway API implementors can support a field consistently then it can be in core
    • If 51% of the Gateway API implementors can support a field in a 100% consistent way, then it can be in extended
    • Custom is for everything else
  • We should also hold regular reviews to go through the requests to see if anything can be moved up on the conveyer belt. Some things will naturally never become core or extended because they are implemented too diversely (timeouts are a good example).

I like those rules, but I think there are some interactions with what parts of the API an implementation will support (which may come under conformance profiles), that will make them more complicated than we have here, I suspect.

Are there any proposals that you can share that we can learn from?

The best one is actually the request timeout. We allowed the setting of the default to whatever the cluster admin wanted, using the global config file. However, we also allowed a per-Route override, which put this back in the hands of the application developers. In the case of the request timeout, there can be conflicting usecases, where the cluster admin doesn't want to allow infinity as a request timeout value, since it leads to connection resources not being garbage collected. But the app dev will find it much easier to set the request timeout to infinity, and not have to worry about it any more.

So, we needed to give the cluster admin a set of global configuration options to set the maximum and minimum allowed request timeouts. Meaning that we now had to keep track of the following request timeout properties:

  • Default value
  • configured value (per-route)
  • Maximum value
  • Minimum value

Whenever we needed to do anything with request timeout, we would need to think about all four values. And that's for one timeout! Envoy has > 10 timeouts, many of which would need similar treatement. On top of all that, the code that was required to validate the timeouts ends up looking very similar to the Rego code that Gatekeeper uses anyway, and the UX is bad because you don't get immediate feedback about if the timeout value you supplied is okay (as an app developer, you have to apply your object, and then check if it's been accepted with a followup kubectl get or similar).

Once we moved to Gatekeeper instead, we got the following:

  • Cluster admins are encouraged to use a tool that solves the exact problem they want (per-field bounds)
  • In a way that gives them flexibility around what the bounds are, and even what fields they want to set
  • We provide them with the initial templates to get started, so they can get up and running quickly
  • App developers get early feedback about their updates.
  • Contour does not have to do a large amount of engineering to build the capability to do all of the above.

The cost is, of course, that advanced users now have to run another thing to get bounds checking for Contour values. We believed that the overhead of Gatekeeper should be less because, if you have the need for bounds checking on individual Ingress objects, you probably have the need for per-field checking on other objects, so you should be running something like it anyway.

Overall though, I can see how the idea hangs together, and it's quite neat. But I worry about how we would be able to keep things portable here, and how we would define the API itself.

For example, you have the policies YAML stanza under gateway.spec, and service.spec. (I'm eliding any difficulties we may have with adding anything to Service for now). What type are they?

It looks like the entries under there are some sort of ObjectReference or similar, so anyone can plug any object in there. In that case, how do we incentivise people to keep things portable? As a Contour maintainer, I'm probably not going to support the cloud.google.com ServicePolicy, right? So I'll end up with my own Kind to go in there, that will most likely be subtly different.

I guess what I'm saying is that, it seems like a huge risk to the portability goals that are at the heart of the Gateway APIs. I understand that having a way to specify common properties without needing to create fields in the upstream API is great, but at the same time we need to ensure there is a little pressure to make your fields standard too.

@robscott robscott added this to the v1alpha2 milestone May 5, 2021
@robscott
Copy link
Member

Thanks for all the great discussion in this thread @youngnick, @stevesloka, and @mark-church! I've got a few follow ups + a new doc for further discussion.

But I worry about how we would be able to keep things portable here, and how we would define the API itself.

Agree that this is a very real concern. As we've shown with things like timeout though, there are at least some concepts that likely can't be portable. As much as I'd like to pull as much as possible to into the API spec, I don't think we should if <50% of implementations will be able to support a concept. That means we need some kind of standard way to extend the API. With Ingress, that was unspecified and resulted in confusing and inconsistent extension mechanisms.

In my mind we need to:

  • Encourage pulling concepts into the API spec that can be portable
  • Provide a standard way to attach policy and extensions when concepts are not portable

As a Contour maintainer, I'm probably not going to support the cloud.google.com ServicePolicy, right? So I'll end up with my own Kind to go in there, that will most likely be subtly different.

I hope if there are any concepts that are only subtly different we'll be able to pull them into a shared policy resource that is included in this API spec (or include the concepts directly on Route or Gateway resources). With that said, I think it's reasonable to expect that each implementation will have some unique config that others will not understand.

I guess what I'm saying is that, it seems like a huge risk to the portability goals that are at the heart of the Gateway APIs. I understand that having a way to specify common properties without needing to create fields in the upstream API is great, but at the same time we need to ensure there is a little pressure to make your fields standard too.

I agree that we need to be very careful here, but I'm also concerned about the lack of guidance/standards here. I think implementation specific policy extensions are going to happen regardlessly. If we don't provide a clear pattern for how that could work, we'll end up with the kind of fragmented/inconsistent extensions that ingress has.


With all that said, I've been working on an updated proposal for how we can attach policy to resources. I've been trying to find the right balance between all of the concepts/hesitation above with the need to have some kind of consistent pattern here. So here's another take on how we could approach policy attachment. Although I tried out a few different approaches (also covered in the doc) I ended up with something that was largely similar to merging how BackendPolicy is attached with the Service Policy proposal @mark-church linked to start this issue.

@robscott robscott linked a pull request Jul 28, 2021 that will close this issue
@hbagdi
Copy link
Contributor

hbagdi commented Aug 3, 2021

xref #98 #99 #91 #89 #114 #97 #611

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants