add retry limiter to backoff function #1478

Tema · 2024-10-16T17:59:46Z

Current backoff policy does not help much to limit TiDB reties to retrieve Region from PD when there are issues with Region Metadata in PD:

This PR adds an ability to configure global retry limiter to Backoff function per each config. It also creates a new Backoff config dedicated to PD Region Metadata calls which will be used in TiDB in separate PR:

BoPDRegionMetadata = NewConfigWithRetryLimit("pdRegionMetadata", &metrics.BackoffHistogramPD, NewBackoffFnCfg(500, 3000, EqualJitter), NewRetryRateLimiter(10, 0.1), tikverr.NewErrPDServerTimeout(""))

The config above allows a single retry per each 10 previous successful call (0.1), but limit overall retry budget to 10. It always start with full budget of retries.

ti-chi-bot · 2024-10-16T17:59:50Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign jackysp for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

cfzjywxk · 2024-10-18T06:43:22Z

@Tema
Thanks helping with the improvements.

It's ok to introduce a rate-limiting mechanism for the kv-client. Finding the optimal balance between error handling, retry success, and avoiding overloading PD could be challenging, or selecting a suitable default value that works for most scenarios.

Another approach is similar to TiKV health control feedback (as discussed in tikv/tikv#16297), where some processing capacity information is carried in PD responses and fed back to the KV client. Based on this feedback, the kv-client can then decide its concurrency control and rate-limiting strategy accordingly.

Tema · 2024-10-18T23:31:11Z

Thanks cfzjywxk for the comment. I think it is not always possible for PD to reply to provide this information to TiDB in case it is completely overloaded. tikv/pd#8678 proposes a more sophisticated solution to cover that case as well. Maybe it is worth to see if it could be incorporated into tikv/tikv#16297 which you mentioned.
Anyways, these referenced solutions looks too heavy and take some time to productionize. While this PR is more like a simple stop bleeding fix to prevent this problem asap.

cfzjywxk · 2024-10-22T01:45:18Z

config/retry/config.go

@@ -96,6 +98,50 @@ func NewConfig(name string, metric *prometheus.Observer, backoffFnCfg *BackoffFn
 	}
 }

+type RetryRateLimiter struct {


Need comments like for exported functions and type definitions,

// RetryRateLimiter is used to limit retry times for PD requests.

or the lint check would fail.

somehow golangci-lint run didn't complain. Fixed each comment

cfzjywxk · 2024-10-22T01:45:40Z

config/retry/config.go

+	cap                        int32
+}
+
+func NewRetryRateLimiter(cap int32, ratio float32) *RetryRateLimiter {


Ditto for the comments, and better to explain the meaning of the input parameters.

Besides, would it be less expensive using int or uint type for ratio instead of float values?

The reason it is float that the ration usually less than 1. E.g. 1 retry for each 10 success meant 0.1. In the last commit I've inverted it to be successPerRetryCount, so now can use int. I chose int over uint as there is no rand.uint version.

cfzjywxk · 2024-10-22T01:48:11Z

config/retry/config.go

+	}
+}
+
+// add a token to the rate limiter bucket according to configured retry to success ratio and cap


The comment format is like

// addRetryToken is a ...

, needs to start with the function name

cfzjywxk · 2024-10-22T01:49:12Z

config/retry/config.go

+
+// add a token to the rate limiter bucket according to configured retry to success ratio and cap
+func (r *RetryRateLimiter) addRetryToken() {
+	if rand.Float32() < r.allowedRetryToSuccessRatio {


As metioned above, would it be less expensive to use integer random values?

cfzjywxk · 2024-10-22T01:52:03Z

config/retry/config.go

+	return false
+}
+
+func NewConfigWithRetryLimit(name string, metric *prometheus.Observer, backoffFnCfg *BackoffFnCfg, retryRateLimiter *RetryRateLimiter, err error) *Config {


Ditto for the comments of exported function.

cfzjywxk · 2024-10-22T01:53:23Z

config/retry/config.go

+	BoTiFlashRPC       = NewConfig("tiflashRPC", &metrics.BackoffHistogramRPC, NewBackoffFnCfg(100, 2000, EqualJitter), tikverr.ErrTiFlashServerTimeout)
+	BoTxnLock          = NewConfig("txnLock", &metrics.BackoffHistogramLock, NewBackoffFnCfg(100, 3000, EqualJitter), tikverr.ErrResolveLockTimeout)
+	BoPDRPC            = NewConfig("pdRPC", &metrics.BackoffHistogramPD, NewBackoffFnCfg(500, 3000, EqualJitter), tikverr.NewErrPDServerTimeout(""))
+	BoPDRegionMetadata = NewConfigWithRetryLimit("pdRegionMetadata", &metrics.BackoffHistogramPD, NewBackoffFnCfg(500, 3000, EqualJitter), NewRetryRateLimiter(10, 0.1), tikverr.NewErrPDServerTimeout(""))


As discussed in the previous comments, it would be challendge to choose a default value for all kinds of scenarios? Do we have some tests for the choice of 10, 0.1?

Yeah, choosing the config value is tricky. 10% retry to success seems a reasonable limit. 10 as a retry cap does not matter much as in steady state it will still be proportional to success rate and just limit the bursts.

Maybe we need to look at a way to allow to configure it per type at tidb-server, then we can disable by default and give a way to enable it through configuration with specific values. But I don't know what could be a good config for that. Sysvars and toml would be messy for that. Maybe some system table could be a good interface for that, but I haven't seen such approach at TiDB yet.

@cfzjywxk @niubell how about we add very conservative config, say (cap: 1000, success/retry: 1). Basically for each success we allow one more retry on average with very relaxed bursts of 1000 retries and use it just for loadRegion(s). This way we unlikely affect any steady state, but still limit QPS spike due to retries at most to 2x of steady state.

Signed-off-by: artem_danilov <[email protected]>

ti-chi-bot bot added the dco-signoff: yes Indicates the PR's author has signed the dco. label Oct 16, 2024

ti-chi-bot bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 16, 2024

cfzjywxk reviewed Oct 22, 2024

View reviewed changes

add retry limiter to backoff function

f780e41

Signed-off-by: artem_danilov <[email protected]>

Tema force-pushed the retry-limit branch from c9170c0 to d82976c Compare October 24, 2024 04:06

ti-chi-bot bot added dco-signoff: no Indicates the PR's author has not signed dco. and removed dco-signoff: yes Indicates the PR's author has signed the dco. labels Oct 24, 2024

Tema force-pushed the retry-limit branch from d82976c to 2de173c Compare October 24, 2024 04:24

ti-chi-bot bot added dco-signoff: yes Indicates the PR's author has signed the dco. and removed dco-signoff: no Indicates the PR's author has not signed dco. labels Oct 24, 2024

Tema force-pushed the retry-limit branch from 2de173c to e93f973 Compare October 24, 2024 04:24

change ratio to int, document and more tests

a1b5339

Signed-off-by: artem_danilov <[email protected]>

Tema force-pushed the retry-limit branch from e93f973 to a1b5339 Compare October 24, 2024 04:26

ti-chi-bot bot added dco-signoff: no Indicates the PR's author has not signed dco. and removed dco-signoff: yes Indicates the PR's author has signed the dco. labels Oct 25, 2024

start using the retry limiter

3a70f7e

Signed-off-by: artem_danilov <[email protected]>

Tema force-pushed the retry-limit branch from da98bdc to 3a70f7e Compare October 25, 2024 02:48

ti-chi-bot bot added dco-signoff: yes Indicates the PR's author has signed the dco. and removed dco-signoff: no Indicates the PR's author has not signed dco. labels Oct 25, 2024

Tema added a commit to Tema/tidb that referenced this pull request Oct 25, 2024

test tikv/client-go#1478

023d291

Tema added a commit to Tema/tidb that referenced this pull request Oct 25, 2024

test tikv/client-go#1478

4bb15a6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add retry limiter to backoff function #1478

add retry limiter to backoff function #1478

Tema commented Oct 16, 2024

ti-chi-bot bot commented Oct 16, 2024

cfzjywxk commented Oct 18, 2024 •

edited

Loading

Tema commented Oct 18, 2024

cfzjywxk Oct 22, 2024

Tema Oct 24, 2024

cfzjywxk Oct 22, 2024

Tema Oct 24, 2024

cfzjywxk Oct 22, 2024

Tema Oct 24, 2024

cfzjywxk Oct 22, 2024

Tema Oct 24, 2024

cfzjywxk Oct 22, 2024

Tema Oct 24, 2024

cfzjywxk Oct 22, 2024

Tema Oct 24, 2024

Tema Oct 25, 2024

add retry limiter to backoff function #1478

Are you sure you want to change the base?

add retry limiter to backoff function #1478

Conversation

Tema commented Oct 16, 2024

ti-chi-bot bot commented Oct 16, 2024

cfzjywxk commented Oct 18, 2024 • edited Loading

Tema commented Oct 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cfzjywxk commented Oct 18, 2024 •

edited

Loading