Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add timeouts to destination connection details #89

Closed
hbagdi opened this issue Feb 13, 2020 · 18 comments
Closed

Add timeouts to destination connection details #89

hbagdi opened this issue Feb 13, 2020 · 18 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@hbagdi
Copy link
Contributor

hbagdi commented Feb 13, 2020

What would you like to be added:

Destination service should include connect, read and write timeouts to be used by the proxy in addition to specifying a port (#48) and service name. This should be per service configuration as services of different types can requires different timeouts based on the workloads.

Why is this needed:

All proxies and LB implementation have timeouts that can be defined for establishing a TCP connection, read timeouts, write timeouts, sometimes referred to as idle timeouts.

/kind feature

@hbagdi hbagdi added the kind/feature Categorizes issue or PR as related to a new feature. label Feb 13, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 14, 2020
@hbagdi
Copy link
Contributor Author

hbagdi commented May 14, 2020

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 14, 2020
@hbagdi
Copy link
Contributor Author

hbagdi commented May 14, 2020

xref #184

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 12, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 11, 2020
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@robscott
Copy link
Member

/reopen
/remove-lifecycle rotten

@k8s-ci-robot
Copy link
Contributor

@robscott: Reopened this issue.

In response to this:

/reopen
/remove-lifecycle rotten

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Oct 11, 2020
@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Oct 11, 2020
@hbagdi
Copy link
Contributor Author

hbagdi commented Dec 7, 2020

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Dec 7, 2020
@robscott robscott added this to the v0.3.0 milestone Mar 24, 2021
@robscott robscott added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Apr 14, 2021
@robscott
Copy link
Member

robscott commented Apr 21, 2021

I've spent some time digging into just how portable this could be, and unfortunately it looks like there is quite a bit of variation on how different projects handle this. One of the most broadly supported concepts is that of a request or read timeout, here are some examples of how that is defined:

  • GCP: This is the amount of time that the load balancer waits for a backend to return a full response to a request.
  • Envoy: The amount of time that Envoy will wait for the entire request to be received. The timer is activated when the request is initiated, and is disarmed when the last byte of the request is sent upstream (i.e. all decoding filters have processed the request), OR when the response is initiated.
  • Traefik: The maximum duration for reading the entire request, including the body.
  • HAProxy: The maximum accepted time to receive a complete HTTP request without affecting the client timeout… By default, this timeout only applies to the header part of the request, and not to any data. As soon as the empty line is received, this timeout is not used anymore. When combined with “option http-buffer-request”, this timeout also applies to the body of the request.
  • Skipper: Skipper responds with 504 Gateway Timeout status if obtaining a connection, sending the request, and reading the backend response headers and body takes longer than the configured timeout

At first glance, there appears to be some level of potential portability here, but even these relatively similar definitions have some divergence in what they're doing.

When we start to look a bit more broadly, we can see that implementations like nginx and caddyserver have very different timeout configurations that would simply not be portable with any of what's specified above.

I think @hbagdi summarized this well in a related Slack thread today:

  • Most implementations define timeouts in very different ways.
  • Timeouts also change must less often. Sometimes never.
  • Users want a way to configure timeouts. If they are not portable, that's fine. But they would like to keep timeouts separate from the portable configuration. Meaning, I can reuse most of my xRoute resources and I've to tweak only a couple of resources (BackendPolicy, ServicePolicy or such) when I'm migrating from one to another (or using 2 at the same time).

What we are trying to do here is create a higher-order abstraction in the API to achieve portability. But as we have just realized, portability is not possible at all. So this won't work.

And I think we don't even need that.

What Gateway API aims to provide is a consistent "experience". That doesn't require taking the same set of configuration and moving it from one implementation to another. That is a pipe dream (IMHO).

What would a consistent experience look like? A way to configure implementation-specific timeouts in a consistent manner - using the same location/resource in the API.

I think that should be enough because that gives everyone a good trade-off between maintainability, portability and consistency.

I'm interested in some additional perspective here - did we miss any paths to portability here? If not, maybe we should close out this issue?

/cc @bowei @danehans @hbagdi @jpeach @youngnick

@youngnick
Copy link
Contributor

I think the way @hbagdi laid it out is both clear and correct. There's just too much variability here to really be able to make these settings portable.

But the other part of what he said is probably addressable under this issue, namely having a single spot in the API where you put timeouts, and a standard shape of thing to put there, probably a CRD per underlying implementation or something? That would let an implementation easily check if it supported a particular timeout set, or could translate the timeouts, and if not, keeps the config confined to one spot.

@bowei
Copy link
Contributor

bowei commented Apr 21, 2021

One question:

How many different types of timeouts are supported for each of the variants?
For example, is it suffice to say:

ReadTimeout
WriteTimeout

But the specific (which byte) qualifies is up to the implementation?

Do we need to be 100% proscriptive or is it still useful at a high level and point the user to the specific documentation when they have to get into the weeds?

@robscott robscott removed this from the v0.3.0 milestone Apr 23, 2021
@youngnick
Copy link
Contributor

I think @robscott's review showed that even the definitions of what a particular timeout is are really varied between implementations. So I don't know if even that level of generic timeout is possible (because different implementations will have timeouts that may add up to that timeout or something).

@jpeach
Copy link
Contributor

jpeach commented Apr 29, 2021

Timeouts in general are a incompatible, but there are some general use cases that might be able to help us

  • is this is streaming endpoint?
  • is this a request/response endpoint?
  • is this a large content upload/download endpoint?

Orthogonal to these issues is how to handle idle sessions.

So maybe an approach for the portable API is to represent timeouts as profiles that an implementation can use to do the right thing (or expand into something specific)?

@hbagdi
Copy link
Contributor Author

hbagdi commented Aug 23, 2021

#713 covers this in some sense.
Safe to close?
/cc @robscott @jpeach @youngnick

@robscott
Copy link
Member

Yeah I think that's probably the most fitting solution here.

/close

@k8s-ci-robot
Copy link
Contributor

@robscott: Closing this issue.

In response to this:

Yeah I think that's probably the most fitting solution here.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jaison-tiu pushed a commit to jaison-tiu/gateway-api that referenced this issue Apr 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests

7 participants