Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

api: support retry on in BackendTrafficPolicy #2168

Closed
wants to merge 5 commits into from

Conversation

tmsnan
Copy link
Contributor

@tmsnan tmsnan commented Nov 9, 2023

api: support retry on in BackendTrafficPolicy

Relates to #1821 & #1907

@tmsnan tmsnan requested a review from a team as a code owner November 9, 2023 07:04
@tmsnan tmsnan marked this pull request as draft November 9, 2023 07:12
Copy link

codecov bot commented Nov 9, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (20b8497) 64.71% compared to head (d3057c4) 64.71%.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2168   +/-   ##
=======================================
  Coverage   64.71%   64.71%           
=======================================
  Files         115      115           
  Lines       17431    17431           
=======================================
  Hits        11281    11281           
  Misses       5433     5433           
  Partials      717      717           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@tmsnan tmsnan marked this pull request as ready for review November 9, 2023 09:19
@arkodg
Copy link
Contributor

arkodg commented Nov 10, 2023

@kflynn and @AliceProxy, can you help review this one

@kflynn
Copy link
Contributor

kflynn commented Nov 14, 2023

@arkodg @tmsnan Sorry for the delay here! and many thanks for digging into this, I appreciate that. 🙂

That said, I'd like to formalize something I've said on many other occasions:

Flynn's First Law of Envoy: Having end users interact directly with Envoy configuration concepts is almost always a huge mistake. 🙂

In Linkerd, the basic retry configuration is one line: isRetryable: true. You can go beyond that, but you don't have to. In Emissary, it's a little more complex than that, but it's still only... three lines, I think? The fact that Envoy has a ton of options is nearly always irrelevant to the user; let's not force them to use configuration structures that look anything like Envoy.

My strong preference here is that the base case should be a single line of configuration: to get sane retries with good defaults, the user should need only to add isRetryable: true, or something very similar, at the top level of the BackendTrafficPolicy. Having a separate stanza for configuring policy is entirely reasonable but it should be optional.

It's certainly possible that I missed something here, and if so I apologize! 🙂 But in Envoy Gateway's name, the "Gateway" word is the really important bit, right? 🙂

@tmsnan
Copy link
Contributor Author

tmsnan commented Nov 16, 2023

@arkodg @tmsnan Sorry for the delay here! and many thanks for digging into this, I appreciate that. 🙂

That said, I'd like to formalize something I've said on many other occasions:

Flynn's First Law of Envoy: Having end users interact directly with Envoy configuration concepts is almost always a huge mistake. 🙂

In Linkerd, the basic retry configuration is one line: isRetryable: true. You can go beyond that, but you don't have to. In Emissary, it's a little more complex than that, but it's still only... three lines, I think? The fact that Envoy has a ton of options is nearly always irrelevant to the user; let's not force them to use configuration structures that look anything like Envoy.

My strong preference here is that the base case should be a single line of configuration: to get sane retries with good defaults, the user should need only to add isRetryable: true, or something very similar, at the top level of the BackendTrafficPolicy. Having a separate stanza for configuring policy is entirely reasonable but it should be optional.

It's certainly possible that I missed something here, and if so I apologize! 🙂 But in Envoy Gateway's name, the "Gateway" word is the really important bit, right? 🙂

  1. IMO, the RetryStrategy in BackendTrafficPolicy offers a more comprehensive range of options. Users can customize the retry backoff strategy, retry budget strategy, and retry triggering conditions. However, these advanced features may be too complex for typical Gateway(Not envoy) users. We need to consider whether to fully retain these higher-order uses.

  2. It is indeed true that most users do not customize the configuration and prefer to use the recommended settings provided by the gateway to enable the retry feature. I agree with this. Perhaps we could also keep the above-mentioned API and add a toggle switch to enable the commonly used retry function. @arkodg @kflynn

@Xunzhuo Xunzhuo added this to the v1.0.0-rc1 milestone Dec 6, 2023
@kflynn
Copy link
Contributor

kflynn commented Dec 7, 2023

@tmsnan, so sorry for the delay (again)!

In practice, we haven't really found that everything needs support, and many things just don't get used all that often. Here's an alternative that deliberately tries to minimize the amount of indentation and configuration needed for the things I think are very common:

Suggestion

---
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
  name: example-policy
spec:
  ...
  isRetryable: true
  retryStrategy:
    maxBudget: 20    # percentage, default 20%
    maxRetries: 0    # count, if nonzero, maxBudget is ignored
    maxParallel: 3   # Optional
    minConcurrent: 3 # Optional

    retryOn:
      trigger: trigger-enum (see below)
      httpStatus: []int # valid when trigger is "retriable-status-code"

    perRetry:
      timeout: duration # optional
      idleTimeout: duration # optional
      backoff: # optional
        baseInterval: duration # required
        maxInterval: duration # optional
        # we can add rate limited based backoff config here if we want to

trigger-enum: one of
  5xx                       # HTTP events
  gateway-error             
  disconnect-reset
  connect-fail
  retriable-4xx
  refused-stream
  retriable-status-codes

  cancelled                 # GRPC events
  deadline-exceeded
  internal
  resource-exhausted
  unavailable

The idea here is that the proxy should know whether it's speaking HTTP or gRPC to a given backend. There's no point in configuring HTTP retries for a gRPC backend, and minimal config shouldn't require rewriting the retry config if you change types. (Obviously if you change protocols when you have configured a trigger, you'd need to change that.)

Examples

These are not meant to be exhaustive; they're meant to show specific cases that might come up.

Minimal config

---
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
  name: example-policy
spec:
  isRetryable: true

Just give me retries and don't make me think about it. This will enable budgeted retries with a 20% retry budget. (Why budgeted and not counted? because the user experience is better for really simple configurations.)

Minimal counted retries

---
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
  name: example-policy
spec:
  isRetryable: true
  retryStrategy:
    maxRetries: 1

Retry, but only once. (It might be worth considering whether the presence of retryStrategy implies isRetryable: true...)

Retry only on 499 or 505 (I don't recommend this)

---
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
  name: example-policy
spec:
  isRetryable: true
  retryStrategy:
    retryOn:
      trigger: retriable-status-codes
      httpStatus: [ 499, 505 ]

Budgeted retries, but only retry on 499 or 505 status codes.

Retry, controlling timing, on gRPC cancelled.

---
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
  name: example-policy
spec:
  isRetryable: true
  retryStrategy:
    retryOn:
      trigger: cancelled
    perRetry:
      timeout: 250ms
      idleTimeout: 5s
      backoff:
        baseInterval: 100ms
        maxInterval: 10s

Summary

This is the kind of thing I mean when I say that we don't need to expose all the Envoy config. Envoy's API almost always considers every possible configuration as equal weight, as well as allowing a certain amount of configuration that's kind of nonsensical -- in practice people do not mix HTTP and gRPC to the same endpoint, for example, so configuring both at the same time just doesn't make much sense. We can - and should - be more opinionated, so that the common things are easy.

In, of course, my opinion. 🙂

@tmsnan
Copy link
Contributor Author

tmsnan commented Dec 11, 2023

@kflynn, thanks, the API looks much simpler, and there's really no need to distinguish between HTTP and gRPC retry trigger conditions. It's a minor point that we might need to configure multiple conditions in the trigger simultaneously.

apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
  name: example-policy
spec:
  isRetryable: true
  retryStrategy:
    retryOn:
      trigger:  gateway-error, retriable-4xx

I will update the API later.

@tmsnan tmsnan force-pushed the dev-support-retry-on branch from 01c7a9e to 24e2f66 Compare December 15, 2023 03:12
Copy link

github-actions bot commented Dec 15, 2023

@github-actions github-actions bot temporarily deployed to pull request December 15, 2023 03:15 Inactive
@github-actions github-actions bot temporarily deployed to pull request December 15, 2023 06:44 Inactive

// IsRetryAble indicates whether to enable budget retries with a 20% retry budget.
// +optional
IsRetryAble bool `json:"isRetryAble,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer we didn't include that and instead relied only on Retry

// RetryStrategy provides more advanced retry usage, which allows users to customize the retry method (number of retries、retry budget and concurrency max retries),
// retry fallback strategy, and retry triggering conditions
// +optional
RetryStrategy *RetryStrategy `json:"retryStrategy,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we rename to retry ? all fields in here are singular

// NumRetries is the number of retries to be attempted. Defaults to 0. If nonzero, maxBudget is ignored.
//
// +optional
MaxRetries *int `json:"maxRetries,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this numRetries or maxRetries ?


// RetryOn specifies the retry trigger condition.
//
// +optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets add some defaults ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like istio?

    retryOn:
      trigger: retriable-status-codes,connection-failure,  refused-stream, unavailable,cancelled
      httpStatus: [ 503 ]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah as well as numRetries

tmsnan and others added 3 commits January 11, 2024 10:43
Signed-off-by: nan zhao <[email protected]>
Signed-off-by: zhaonan <[email protected]>
Signed-off-by: nan zhao <[email protected]>
@tmsnan tmsnan force-pushed the dev-support-retry-on branch from 3c9b667 to 1a24ca4 Compare January 11, 2024 03:02
@arkodg
Copy link
Contributor

arkodg commented Jan 18, 2024

hey @tmsnan, sorry for the delay in review

  1. can we rm MaxBudget , MaxParallel & MinConcurrent in this first iteration
  2. can we decide on sane defaults, so when a user sets
retry: {}

it results in

  • NumRetries - X
  • RetryOn.Triggers - [......]

Signed-off-by: nan zhao <[email protected]>
Signed-off-by: nan zhao <[email protected]>
@tmsnan
Copy link
Contributor Author

tmsnan commented Jan 19, 2024

hey @tmsnan, sorry for the delay in review

  1. can we rm MaxBudget , MaxParallel & MinConcurrent in this first iteration
  2. can we decide on sane defaults, so when a user sets
retry: {}

it results in

  • NumRetries - X
  • RetryOn.Triggers - [......]

done

}

// TriggerEnum specifies the conditions that trigger retries.
type TriggerEnum string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs a validation annotation

//
// +optional
// +kubebuilder:validation:Format=duration
IdleTimeout *metav1.Duration `json:"idleTimeout,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also rm IdleTimeout in the first iteration ?

@arkodg
Copy link
Contributor

arkodg commented Feb 13, 2024

hey @tmsnan, this API is pretty close to the finish line, can you address the review comments ? tia

//
// +optional
// +kubebuilder:default=2
NumRetries *int `json:"numRetries,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe int32 is more explicit. Also, we can validate that the value is >= 0.

// HttpStatusCodes specifies the http status codes to be retried.
//
// +optional
HTTPStatusCodes []int `json:"httpStatusCodes,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe validate status in accepted range 1xx - 5xx

@tmsnan
Copy link
Contributor Author

tmsnan commented Feb 18, 2024

hey @tmsnan, this API is pretty close to the finish line, can you address the review comments ? tia

Sorry for the late reply, there have been some family matters recently, and I will handle them soon.

@arkodg
Copy link
Contributor

arkodg commented Feb 19, 2024

hey @tmsnan, this API is pretty close to the finish line, can you address the review comments ? tia

Sorry for the late reply, there have been some family matters recently, and I will handle them soon.

Sorry to hear @tmsnan.
No rush to pick up this PR, @guydc can help move this PR forward

@guydc
Copy link
Contributor

guydc commented Feb 23, 2024

Closing as this work was merged in #2657

@guydc guydc closed this Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

5 participants