Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upstream: Implement retry concurrency budgets #9069

Merged
merged 77 commits into from
Jan 7, 2020

Conversation

tonya11en
Copy link
Member

@tonya11en tonya11en commented Nov 19, 2019

This patch implements RetryBudget, a new configuration option in the retry policy that limits allowed outstanding retries. The budget is specified as a percentage of the current outstanding requests/connections. Each retry budget must specify a minimum concurrency value to the budgets, so that one can retry when there are low concurrency values. This configuration is optional and will not change existing behavior unless configured. If configured, the retry budgets will override any max_retries circuit breaker configuration.

For example, a budget of 20% with a minimum retry concurrency of 3 will allow 5 active retries while there are 25 active requests. If there are 2 active requests, there are still 3 active retries allowed because of the minimum retry concurrency.

This approach to limiting retries and mitigating retry storms is useful, because it allows one to think in terms of worst-case increases in traffic due to retries. Retry circuit breakers are fixed values and can potentially limit retries when the mesh can handle the increased volume. This is especially beneficial for adaptive concurrency for high RPS services.

Fixes: #8727
Risk Level: Low
Testing: Unit tests
Docs Changes: Stats and proto.

tonya11en and others added 7 commits November 12, 2019 22:25
Signed-off-by: Tony Allen <[email protected]>
Signed-off-by: Tony Allen <[email protected]>
Signed-off-by: Tony Allen <[email protected]>
Signed-off-by: Tony Allen <[email protected]>
Signed-off-by: Tony Allen <[email protected]>
Signed-off-by: Tony Allen <[email protected]>
Signed-off-by: Tony Allen <[email protected]>
@repokitteh-read-only
Copy link

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to api/.

🐱

Caused by: #9069 was opened by tonya11en.

see: more, trace.

Signed-off-by: Tony Allen <[email protected]>
@tonya11en
Copy link
Member Author

/retest

@repokitteh-read-only
Copy link

🔨 rebuilding ci/circleci: coverage (failed build)

🐱

Caused by: a #9069 (comment) was created by @tonya11en.

see: more, trace.

@mattklein123 mattklein123 self-assigned this Nov 20, 2019
@tonya11en
Copy link
Member Author

/retest

@repokitteh-read-only
Copy link

🔨 rebuilding ci/circleci: coverage (failed build)

🐱

Caused by: a #9069 (comment) was created by @tonya11en.

see: more, trace.

Copy link
Contributor

@snowp snowp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good, just a couple comments


// If a retry budget was configured, we cannot exceed the configured percentage of total
// outstanding requests/connections.
const uint64_t current_active = cluster_.resourceManager(priority_).connections().count() +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit iffy to me: today kinda this works before for H/1.1 and H/2 make use of either max_connections OR max_requests so this works, but there are requested features that might introduce max_connections to H/2 as well (#7403). Envoy also doesn't reclaim H/1.1 connections until it needs to, so I think using max_connections to get a sense for the current concurrency isn't perfect.

I think ideally we'd be counting # of requests for HTTP/1.1 and just do max_requests + max_pending, but I don't think we can just add that enforcement safely at this point (since users might have bumped max_con but not max_req). Maybe just note this limitation in the docs for now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted in the docs. Let me know if the latest patch communicates this well enough.

source/common/router/retry_state_impl.cc Outdated Show resolved Hide resolved
Signed-off-by: Tony Allen <[email protected]>
include/envoy/router/router.h Outdated Show resolved Hide resolved
include/envoy/router/router.h Outdated Show resolved Hide resolved
Signed-off-by: Tony Allen <[email protected]>
@repokitteh-read-only
Copy link

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to api/.

🐱

Caused by: #9069 was synchronize by tonya11en.

see: more, trace.

@tonya11en
Copy link
Member Author

/wait-any

Signed-off-by: Tony Allen <[email protected]>
@mattklein123
Copy link
Member

Apologies but can you merge master and format one more time? It should fix all the CI/random format issues.

/wait

@mattklein123
Copy link
Member

Sorry needs another master merge.

/wait

@tonya11en
Copy link
Member Author

tonya11en commented Jan 3, 2020 via email

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo merge issue. Great work!

/wait

test/integration/stats_integration_test.cc Outdated Show resolved Hide resolved
Signed-off-by: Tony Allen <[email protected]>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@@ -314,7 +315,8 @@ TEST_P(ClusterMemoryTestRunner, MemoryLargeClusterSizeWithRealSymbolTable) {
// 2019/11/01 8859 35221 36000 build: switch to libc++ by default
// 2019/11/15 9040 35029 35500 build: update protobuf to 3.10.1
// 2019/11/15 9040 35061 35500 upstream: track whether cluster is local
// 2019/12/10 8779 35053 35000 use var-length coding for name lengths
// 2019/12/20 8779 35053 35000 use var-length coding for name lengths
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still a merge issue. To avoid going back and forth on this again can you fix this in your next change?

@repokitteh-read-only repokitteh-read-only bot removed the api label Jan 7, 2020
@mattklein123 mattklein123 merged commit 3ed917f into envoyproxy:master Jan 7, 2020
zuercher added a commit to zuercher/envoy that referenced this pull request Jan 8, 2020
Since envoyproxy#9069, macOS builds have been failing because they use slightly
more memory for stats than the new limit. Bump limit to next even
multiple of 1000.

Risk Level: low, test change only
Docs Changes: n/a
Release Notes: n/a

Signed-off-by: Stephan Zuercher <[email protected]>
@tonya11en tonya11en deleted the percentage_retries branch February 19, 2020 00:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Percentage based retry circuit breaker
6 participants