upstream: Implement retry concurrency budgets #9069

tonya11en · 2019-11-19T03:19:15Z

This patch implements RetryBudget, a new configuration option in the retry policy that limits allowed outstanding retries. The budget is specified as a percentage of the current outstanding requests/connections. Each retry budget must specify a minimum concurrency value to the budgets, so that one can retry when there are low concurrency values. This configuration is optional and will not change existing behavior unless configured. If configured, the retry budgets will override any max_retries circuit breaker configuration.

For example, a budget of 20% with a minimum retry concurrency of 3 will allow 5 active retries while there are 25 active requests. If there are 2 active requests, there are still 3 active retries allowed because of the minimum retry concurrency.

This approach to limiting retries and mitigating retry storms is useful, because it allows one to think in terms of worst-case increases in traffic due to retries. Retry circuit breakers are fixed values and can potentially limit retries when the mesh can handle the increased volume. This is especially beneficial for adaptive concurrency for high RPS services.

Fixes: #8727
Risk Level: Low
Testing: Unit tests
Docs Changes: Stats and proto.

Signed-off-by: Tony Allen <[email protected]>

repokitteh-read-only · 2019-11-19T03:19:20Z

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to api/.

🐱

Caused by: #9069 was opened by tonya11en.

see: more, trace.

api/envoy/api/v2/route/route.proto

Signed-off-by: Tony Allen <[email protected]>

tonya11en · 2019-11-19T20:36:25Z

/retest

repokitteh-read-only · 2019-11-19T20:36:30Z

🔨 rebuilding ci/circleci: coverage (failed build)

🐱

Caused by: a #9069 (comment) was created by @tonya11en.

see: more, trace.

api/envoy/api/v2/route/route.proto

tonya11en · 2019-11-20T23:26:45Z

/retest

repokitteh-read-only · 2019-11-20T23:26:50Z

🔨 rebuilding ci/circleci: coverage (failed build)

🐱

Caused by: a #9069 (comment) was created by @tonya11en.

see: more, trace.

Signed-off-by: Tony Allen <[email protected]>

snowp

Looks pretty good, just a couple comments

snowp · 2019-11-23T20:10:30Z

source/common/router/retry_state_impl.cc

+
+  // If a retry budget was configured, we cannot exceed the configured percentage of total
+  // outstanding requests/connections.
+  const uint64_t current_active = cluster_.resourceManager(priority_).connections().count() +


This seems a bit iffy to me: today kinda this works before for H/1.1 and H/2 make use of either max_connections OR max_requests so this works, but there are requested features that might introduce max_connections to H/2 as well (#7403). Envoy also doesn't reclaim H/1.1 connections until it needs to, so I think using max_connections to get a sense for the current concurrency isn't perfect.

I think ideally we'd be counting # of requests for HTTP/1.1 and just do max_requests + max_pending, but I don't think we can just add that enforcement safely at this point (since users might have bumped max_con but not max_req). Maybe just note this limitation in the docs for now?

Noted in the docs. Let me know if the latest patch communicates this well enough.

source/common/router/retry_state_impl.cc

Signed-off-by: Tony Allen <[email protected]>

include/envoy/router/router.h

Signed-off-by: Tony Allen <[email protected]>

repokitteh-read-only · 2019-11-28T08:28:32Z

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to api/.

🐱

Caused by: #9069 was synchronize by tonya11en.

see: more, trace.

Signed-off-by: Tony Allen <[email protected]>

tonya11en · 2019-12-29T22:07:45Z

/wait-any

Signed-off-by: Tony Allen <[email protected]>

mattklein123 · 2019-12-31T16:01:04Z

Apologies but can you merge master and format one more time? It should fix all the CI/random format issues.

/wait

Signed-off-by: Tony Allen <[email protected]>

mattklein123 · 2020-01-03T16:20:26Z

Sorry needs another master merge.

/wait

tonya11en · 2020-01-03T21:53:50Z

I’ll just wait until we’re all back in the office to avoid further conflicts.

On Fri, Jan 3, 2020 at 11:20 AM Matt Klein ***@***.***> wrote: Sorry needs another master merge. /wait — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9069?email_source=notifications&email_token=AAIOZ7RKT5SY7QYPM5IH7VLQ35QUXA5CNFSM4JO5HESKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIBPVZA#issuecomment-570620644>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAIOZ7TYVNDJ64XQQWD5M2TQ35QUXANCNFSM4JO5HESA> .

-- Tony Allen Software Engineer <https://www.lyft.com/>

Signed-off-by: Tony Allen <[email protected]>

mattklein123

LGTM modulo merge issue. Great work!

/wait

test/integration/stats_integration_test.cc

Signed-off-by: Tony Allen <[email protected]>

mattklein123

Thanks!

mattklein123 · 2020-01-07T16:34:29Z

test/integration/stats_integration_test.cc

@@ -314,7 +315,8 @@ TEST_P(ClusterMemoryTestRunner, MemoryLargeClusterSizeWithRealSymbolTable) {
  // 2019/11/01  8859     35221       36000   build: switch to libc++ by default
  // 2019/11/15  9040     35029       35500   build: update protobuf to 3.10.1
  // 2019/11/15  9040     35061       35500   upstream: track whether cluster is local
-  // 2019/12/10  8779     35053       35000   use var-length coding for name lengths
+  // 2019/12/20  8779     35053       35000   use var-length coding for name lengths


This is still a merge issue. To avoid going back and forth on this again can you fix this in your next change?

Since envoyproxy#9069, macOS builds have been failing because they use slightly more memory for stats than the new limit. Bump limit to next even multiple of 1000. Risk Level: low, test change only Docs Changes: n/a Release Notes: n/a Signed-off-by: Stephan Zuercher <[email protected]>

tonya11en and others added 7 commits November 12, 2019 22:25

proto

ee13f16

Signed-off-by: Tony Allen <[email protected]>

first pass

413a55a

Signed-off-by: Tony Allen <[email protected]>

basic tests

43804fd

Signed-off-by: Tony Allen <[email protected]>

cleanup tests

c43f0cd

Signed-off-by: Tony Allen <[email protected]>

docs and stat

24e247d

Signed-off-by: Tony Allen <[email protected]>

more docs

86cd4d5

Signed-off-by: Tony Allen <[email protected]>

remove notes to self

eb75475

Signed-off-by: Tony Allen <[email protected]>

repokitteh-read-only bot added the api label Nov 19, 2019

ramaraochavali reviewed Nov 19, 2019

View reviewed changes

api/envoy/api/v2/route/route.proto Outdated Show resolved Hide resolved

format fixes

4a4e502

Signed-off-by: Tony Allen <[email protected]>

htuch reviewed Nov 20, 2019

View reviewed changes

api/envoy/api/v2/route/route.proto Outdated Show resolved Hide resolved

junr03 assigned snowp Nov 20, 2019

mattklein123 self-assigned this Nov 20, 2019

Tony Allen added 3 commits November 20, 2019 16:03

Kick CI

38c53c8

Signed-off-by: Tony Allen <[email protected]>

Merge remote-tracking branch 'upstream/master' into percentage_retries

0970941

memory test

d5ec251

Signed-off-by: Tony Allen <[email protected]>

snowp suggested changes Nov 23, 2019

View reviewed changes

snow comments

8d8cde0

Signed-off-by: Tony Allen <[email protected]>

snowp suggested changes Nov 27, 2019

View reviewed changes

include/envoy/router/router.h Outdated Show resolved Hide resolved

include/envoy/router/router.h Outdated Show resolved Hide resolved

snow struct

e74e782

Signed-off-by: Tony Allen <[email protected]>

tonya11en added 4 commits November 28, 2019 13:06

ref

b11c614

Signed-off-by: Tony Allen <[email protected]>

Merge remote-tracking branch 'upstream/master' into percentage_retries

87ca3a8

Override retry CB if budgets configured

9a0c03f

Signed-off-by: Tony Allen <[email protected]>

format

f9137c9

Signed-off-by: Tony Allen <[email protected]>

Merge remote-tracking branch 'upstream/master' into percentage_retries

94f63e8

Signed-off-by: Tony Allen <[email protected]>

repokitteh-read-only bot removed the waiting label Dec 29, 2019

repokitteh-read-only bot added the waiting:any label Dec 29, 2019

format

4fb492a

Signed-off-by: Tony Allen <[email protected]>

repokitteh-read-only bot removed the waiting:any label Dec 29, 2019

repokitteh-read-only bot added the waiting label Dec 31, 2019

tonya11en added 2 commits December 31, 2019 13:24

Merge remote-tracking branch 'upstream/master' into percentage_retries

005317c

Signed-off-by: Tony Allen <[email protected]>

proto format

dc5f95e

Signed-off-by: Tony Allen <[email protected]>

repokitteh-read-only bot removed the waiting label Dec 31, 2019

repokitteh-read-only bot added the waiting label Jan 3, 2020

tonya11en added 2 commits January 6, 2020 10:57

Merge remote-tracking branch 'upstream/master' into percentage_retries

068e0e7

Signed-off-by: Tony Allen <[email protected]>

format

2446f7b

Signed-off-by: Tony Allen <[email protected]>

repokitteh-read-only bot removed the waiting label Jan 6, 2020

mattklein123 requested changes Jan 6, 2020

View reviewed changes

test/integration/stats_integration_test.cc Outdated Show resolved Hide resolved

repokitteh-read-only bot added the waiting label Jan 6, 2020

nits

da5fe31

Signed-off-by: Tony Allen <[email protected]>

repokitteh-read-only bot removed the waiting label Jan 7, 2020

mattklein123 approved these changes Jan 7, 2020

View reviewed changes

repokitteh-read-only bot removed the api label Jan 7, 2020

mattklein123 merged commit 3ed917f into envoyproxy:master Jan 7, 2020

jmarantz mentioned this pull request Jan 8, 2020

stats_integration_test is failing on Mac #9605

Closed

zuercher mentioned this pull request Jan 8, 2020

tests: fix failing stats integration test on macOS #9607

Closed

tonya11en deleted the percentage_retries branch February 19, 2020 00:28

KBaichoo mentioned this pull request Nov 20, 2023

What will be overridden by retry_budget in config.cluster.v3.CircuitBreakers.Thresholds #30974

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

upstream: Implement retry concurrency budgets #9069

upstream: Implement retry concurrency budgets #9069

tonya11en commented Nov 19, 2019 •

edited

Loading

repokitteh-read-only bot commented Nov 19, 2019

tonya11en commented Nov 19, 2019

repokitteh-read-only bot commented Nov 19, 2019

tonya11en commented Nov 20, 2019

repokitteh-read-only bot commented Nov 20, 2019

snowp left a comment

snowp Nov 23, 2019

tonya11en Nov 25, 2019

repokitteh-read-only bot commented Nov 28, 2019

tonya11en commented Dec 29, 2019

mattklein123 commented Dec 31, 2019

mattklein123 commented Jan 3, 2020

tonya11en commented Jan 3, 2020 via email

mattklein123 left a comment

mattklein123 left a comment

mattklein123 Jan 7, 2020

upstream: Implement retry concurrency budgets #9069

upstream: Implement retry concurrency budgets #9069

Conversation

tonya11en commented Nov 19, 2019 • edited Loading

repokitteh-read-only bot commented Nov 19, 2019

tonya11en commented Nov 19, 2019

repokitteh-read-only bot commented Nov 19, 2019

tonya11en commented Nov 20, 2019

repokitteh-read-only bot commented Nov 20, 2019

snowp left a comment

Choose a reason for hiding this comment

snowp Nov 23, 2019

Choose a reason for hiding this comment

tonya11en Nov 25, 2019

Choose a reason for hiding this comment

repokitteh-read-only bot commented Nov 28, 2019

tonya11en commented Dec 29, 2019

mattklein123 commented Dec 31, 2019

mattklein123 commented Jan 3, 2020

tonya11en commented Jan 3, 2020 via email

mattklein123 left a comment

Choose a reason for hiding this comment

mattklein123 left a comment

Choose a reason for hiding this comment

mattklein123 Jan 7, 2020

Choose a reason for hiding this comment

tonya11en commented Nov 19, 2019 •

edited

Loading