*: Implement etcd QoS POC #12290

tangcong · 2020-09-12T05:23:13Z

Draft Proposal

please see latest draft proposal: etcd QoS Feature Design

Background

As the metadata storage of kubernetes, etcd, stability is extremely important. However, at present, etcd will consume a lot of CPU, memory, and bandwidth resources in the face of some highly concurrent expensive read / write requests, resulting in high service load, even OOM, etc. Common expensive requests such as full keyspace fetch, a lot of event queries(list event), a lot of pod queries(list pods), and a lot of configmap writes.

At the same time, etcd currently has only an extremely simple speed limit protection. When etcd's committed index is greater than the applied index threshold is greater than 5000, it will reject all requests and return the error ErrTooManyRequests. Therefore, its shortcomings are obvious, it is impossible to accurately limit the speed of expensive read / write requests, and it cannot effectively prevent the cluster from being overloaded.

Therefore, I hope to implement a QoS feature that allows users to set the maximum qps under different paths and gRPC methods, and can automatically identify expensive read / write requests and limit their speed to prevent etcd overload and further improve etcd cluster Stability.

Goals

● Support the configuration QoS rules based on multiple object types (such as grpcMethod, grpcMethod + request key prefix path, traffic, cpu-intensive/RangeKeyNum, latency)

● Support multiple rate limiters for different scenarios (such as token bucket filter,maxInflightRequest, leaky bucket filter), with good scalability.

● Different QoS rules support the configuration of different rate limiters, priority. if a request satisfies multiple rules, the rule with the highest priority is selected.

● The memory consumption of the QoS feature can be controlled without significantly reducing the server performance.

● Users can customize rules simply through conditional expressions. (For example, dbUsedbytes is about to exceed db quota bytes, you can limit the rate of k8s event events）

POC Implementation/Feature

● Implement QoS API, rules can be adjusted dynamically.
● Support multiple rate limiter and rule type.
● Users can customize rules simply through conditional expressions.

Architecture/Flow

Demo

add qos rule

etcdctl qos add rule-gRPCMethod-1 gRPCMethod Range /prefix 100 TokenBucket 1 0 ""

NAME:
	qos add - adds a qos rule into the cluster

USAGE:
	etcdctl qos add <rule name> <rule type> <subject> <key prefix> <qps> <ratelimiter> <priority(0-9)> <threshold> <condition> [flags]

DESCRIPTION:
	add,update the QoS Rule into the store.
	rule name is the unique name of a qos rule,you can get/update/delete qos rule by name.
	rule type supports gRPCMethod,slowquery,traffic,custom.
	subject supports gRPCMethod(such as Range,Put,Txn,Authenticate) and RangeKeyNum,DBUsedByte.
	key prefix,if it is not empty, the rule will only take effect when the user key matches it..
	qps indicates the maximum number of requests allowed to pass.
	ratelimiter supports TokenBucket,MaxInflight,etc.
	priority indicates the priority of this rule,when a request satisfies multiple rules, the rule with the highest priority is selected.
	threshold indicates rule will take effect when subject(RangeKeyNum,DBUsedByte) > threshold, it runs more fastly than condition expr.
	condition supports complex expression.it supports following keywords,RangeKeyNum,DBUsedByte,gRPCMethod,key and prefix function.
	RangeKeyNum: the total number of keys traversed by the query request
	DBUsedByte: the bytes of backend db used.

	for example:
	$ etcdctl qos add/update rule-gRPCMethod-1 gRPCMethod Range /prefix 100 TokenBucket 1 0 ""
	$ etcdctl qos add/update rule-slowquery-1 slowquery RangeKeyNum "" 100 MaxInflight 9 1000 ""
	$ etcdctl qos add/update rule-custom-1 custom "" "" 100 TokenBucket 2 0  'DBUsedByte > 4096 && prefix(key,"/config") && gRPCMethod == "Put"'

enable qos feature

etcdctl qos enable

etcd benchmark

./bin/tools/benchmark range /prefix --rate 200 --total 3000

13:18:20 etcd1 | {"level":"info","ts":"2020-09-12T13:18:20.173+0800","caller":"qos/enforcer.go:103","msg":"matched rule","object":{"GRPCMethod":"Range","Key":"/prefix","RangeKeyNum":0,"DBUsedByte":0,"MatchedRuleName":""},"rule":"rule_name:\"rule-gRPCMethod-1\" rule_type:\"gRPCMethod\" subject:<name:\"Range\" prefix:\"/prefix\" > qps:100 priority:1 ratelimiter:\"TokenBucket\" "}

POC Goal

The purpose of POC is to confirm whether the community agrees to add this feature and how to implement it. If the community agrees with this feature and roughly agrees with the implementation of this POC solution, I will continue to improve it and open a detailed task list issue, split it into small PRs, add documentation, comments, tests, etc. anyone interested in this feature can join in the development.

tangcong · 2020-09-12T06:23:27Z

ralated issues:
#8483 #12164 #10084 #7381

tangcong · 2020-09-12T06:36:52Z

@xiang90 @gyuho @jpbetz @jingyih @spzala please take a look. could we add the QoS feature as experimental API into v3.5? I'm not sure if we need the ability to customize the rules by condition expression, but someone suggested. The design and implementation support to add more rule executor(such as condition expression). But in 3.5, I suggest that built-in common rules(gRPCMethod/gRPCMethod/slowlog/traffic) can solve most of the scenarios. We can continue to improve based on community feedback in the future.
Looking forward to your suggestions. thanks.

tangcong · 2020-09-12T06:45:31Z

cc @ptabor @wenjiaswe

cfc4n · 2020-09-12T12:38:06Z

This feature is so great. 👍🏻

tangcong · 2020-09-12T12:45:36Z

This feature is so great. 👍🏻

thanks. It will play an important role in improving cluster stability in scenarios such as high concurrency and slow query. At the same time, multi-tenant scenarios also urgently need such a rate-limiting mechanism.

so I look forward to the community can add the QoS features in 3.5, and there is a lot of work to be done later, we can work it together.

cfc4n · 2020-09-12T17:33:58Z

This feature is so great. 👍🏻

thanks. It will play an important role in improving cluster stability in scenarios such as high concurrency and slow query. At the same time, multi-tenant scenarios also urgently need such a rate-limiting mechanism.

so I look forward to the community can add the QoS features in 3.5, and there is a lot of work to be done later, we can work it together.

I'm looking forward to do this feature with you. 😁

Do we need upload package github.com/antonmedv/expr into vendor ? Don't used go mod to auto download it?

EG:
https://github.com/etcd-io/etcd/blob/ef8d33f1efbf796ddeea806c84c6807b85559335/vendor/github.com/antonmedv/expr/ast/node.go

tangcong · 2020-09-12T22:42:00Z

This feature is so great. 👍🏻

thanks. It will play an important role in improving cluster stability in scenarios such as high concurrency and slow query. At the same time, multi-tenant scenarios also urgently need such a rate-limiting mechanism.
so I look forward to the community can add the QoS features in 3.5, and there is a lot of work to be done later, we can work it together.

I'm looking forward to do this feature with you. 😁

Do we need upload package github.com/antonmedv/expr into vendor ? Don't used go mod to auto download it?

EG:
https://github.com/etcd-io/etcd/blob/ef8d33f1efbf796ddeea806c84c6807b85559335/vendor/github.com/antonmedv/expr/ast/node.go

ci is failed if i do not run ./scripts/updatebom.sh and updatedep.sh, it will generate bill-of-materials.json and update vendor.
can you compile locally after downloading my branch?

tangcong · 2020-09-23T01:00:37Z

Yep, just wanted to put forth what are the problem it's solving. I'm assuming the storage rule keeps running continuously to check whether if the sizeIsUse has crossed threshold in a go routine of some sort? @tangcong

I've added the implementation suggestion to the doc too.

Thanks for the lucid explanation.

good. I think some rules can be calculated asynchronously(for example, sizeIsUse has crossed threshold in a go routine), which can improve performance. Some rules that need to match user request parameters may be calculated synchronously, but this will also be a fast operation.

I will take a look at your suggestions later,thanks.

gyuho · 2020-09-23T01:00:54Z

@tangcong @vivekpatani

Thanks all for the POC and design docs.

ok, got it. if we implement it based proxy, it will not change the etcd server binary, but most users do not use proxy. this feature will be introduced behind an experimental flag. If you do not enable the experimental flag, etcd QoS API is not available and will not create related buckets for QoS in boltdb.

We believe such full-featured QoS/QPS layer will be very hard to maintain compatibility with other projects (or code), or even confuse users. We'd like to keep etcd as minimal as possible, rather than adding another complexity to maintain. grpc-proxy is another example. Not everybody is using it but its lack of maturity has been slowing down or blocking the core development.

Unless this can be simply implemented as a gRPC server middleware, I believe this is not the problem etcd is designed to solve. It's already pretty easy to embed rate limiter on client-side however you want, whereas built-in rate limiter will never satisfy all the requirements for all use cases.

tangcong · 2020-09-25T00:43:18Z

thanks. we see. The current etcd QoS Feature Design seems to meet most use cases(for example, related issues that are currently known can be resolved). do you support to do some early prototypes in the proxy? @gyuho

gyuho · 2020-09-25T19:24:03Z

@tangcong Yes, proxy as a separate component sounds better. Thanks.

brandon-miles · 2020-09-25T23:10:30Z

I'm a little disappointed this feature didn't make it to etcd. To me this is less effective at the proxy layer because we lose the ability to have server-side rules (DB size etc.). As an extremely heavy user of etcd and K8s, one of our biggest concerns is running out of DB space, and rate limiting once a certain threshold is hit — to me — was the killer feature of this patch. I understand the desire to keep etcd lean, and throw some of the complexity into the proxy, but in doing so some of the features here get watered down. Honestly, for our use case, I don't see much of a need to rate limit or provide QoS for etcd except if it's to save the cluster from an outage.

Anyhow, thanks @tangcong and @vivekpatani for your work here, I think this would have been a great feature.

tangcong · 2020-09-26T03:27:50Z

thanks. @gyuho @brandon-miles To implement this feature at the proxy layer, it seems that the design proposal does not need major changes. We can implement the QoS API in the proxy, store the rules in the specified prefix key of etcd (user can specify QoS prefix key path), and implement QoS related commands in etcdctl. we can also implement rate limiting based on PercentDBQuotaUsed(we need to get db quota and db used size from etcd periodically). what do you think?

tangcong · 2020-09-26T03:51:24Z

If it is implemented at the proxy layer, will you use it? What is your current etcd deployment based on? @brandon-miles

WIZARD-CXY · 2020-09-27T11:13:53Z

To me this is less effective at the proxy layer because we lose the ability to have server-side rules (DB size etc.).

It is hard for the server side to get the dbsize under one prefix too I think.

ptabor · 2020-09-27T18:35:32Z

It's a common practice to have a separate monitoring system that if needed alerts or activates a rule using dynamic push of configuration to a service (that needs to be high priority). Hard-coding this within the server's logic sounds too custom.

jpbetz · 2020-09-28T16:47:53Z

As an extremely heavy user of etcd and K8s, one of our biggest concerns is running out of DB space, and rate limiting once a certain threshold is hit — to me — was the killer feature of this patch

Have you tried k8s ResourceQuotas? Feedback on where you feel that fell short, if you've already tried it, could be a productive conversation.

I'd also be curious to understand how you think could be solved by QoS on the etcd side. It's not obvious to me that a killer feature exists. I glanced through the design once more and I don't see anything that changed my mind.

brandon-miles · 2020-09-28T17:33:31Z

@jpbetz Yes, we use resource quotas, but to my knowledge they don't track resource size, just count. It's also tricky to estimate average resource size as well, since it can vary based on the cluster's intended usage.

Operationally, having a switch that rate limits requests once a certain DB size is crossed would be a great safety net for us. I know there are a number of areas where you can mitigate this with K8s (priority and fairness etc), but I think it would also be good for etcd to not rely on K8s to always behave and do the right thing.

jpbetz · 2020-09-28T17:56:43Z

@jpbetz Yes, we use resource quotas, but to my knowledge they don't track resource size, just count. It's also tricky to estimate average resource size as well, since it can vary based on the cluster's intended usage.
Operationally, having a switch that rate limits requests once a certain DB size is crossed would be a great safety net for us. I know there are a number of areas where you can mitigate this with K8s (priority and fairness etc), but I think it would also be good for etcd to not rely on K8s to always behave and do the right thing.

Rate limiting is also a major liability. Adding code that delays or drops requests is an inherently risky thing to do. The list of issues I've been personally involved in where a rate limiter caused a production issue is quite long.

Rather then adding unproven code to etcd to try to solve this problem. I think it would be much more helpful to start by first demonstrating an actual success case, and ideally put some real production mileage on it. We don't need to merge code into etcd to do that, and honestly, I think it would be much faster to iterate and improve a solution outside of the etcd code base. I personally would be very interested in the leanings and outcomes of such an experiment, and I think the rest of the etcd community would be too.

WIZARD-CXY · 2020-09-30T08:53:15Z

@jpbetz Yes, we use resource quotas, but to my knowledge they don't track resource size, just count. It's also tricky to estimate average resource size as well, since it can vary based on the cluster's intended usage.
Operationally, having a switch that rate limits requests once a certain DB size is crossed would be a great safety net for us. I know there are a number of areas where you can mitigate this with K8s (priority and fairness etc), but I think it would also be good for etcd to not rely on K8s to always behave and do the right thing.

Rate limiting is also a major liability. Adding code that delays or drops requests is an inherently risky thing to do. The list of issues I've been personally involved in where a rate limiter caused a production issue is quite long.

Rather then adding unproven code to etcd to try to solve this problem. I think it would be much more helpful to start by first demonstrating an actual success case, and ideally put some real production mileage on it. We don't need to merge code into etcd to do that, and honestly, I think it would be much faster to iterate and improve a solution outside of the etcd code base. I personally would be very interested in the leanings and outcomes of such an experiment, and I think the rest of the etcd community would be too.

I agree what jpbetz said.

brandon-miles · 2020-09-30T17:41:51Z

@jpbetz Sounds good. We'll report back once we have some mileage on our internal implementation.

stale · 2020-12-29T19:56:11Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

gyuho · 2021-04-15T14:29:46Z

@tangcong @WIZARD-CXY

I strongly believe the community would hugely benefit from built-in etcd load shedding mechanism. There are just too many unknowns and variables. We at AWS experienced the same need for gracefully handling the load spikes

We may also require client-side behavior changes. For instance, what's the kube-apiserver behavior when it gets "too many requests" errors? Does kube-apiserver have back-off logic? Have we thought about enforcing kube-apiserver to do paginated calls, and for other controllers as well? As far as I know, large range queries are the most demanding workloads in Kubernetes land.

So, I think we will need to revisit this in the near future. Please don't misunderstand our push-backs. I think this is the best POC we have got so far.

tangcong · 2021-04-23T11:42:45Z

/remove-stale

brandon-miles · 2021-04-23T21:10:25Z

Just to follow-up, we are currently rolling out a rate limiter to production based on a modified version of @vivekpatani's PR (#12289). Initial tests have been encouraging, and we now have some defense against bad Kube behaviors (large range queries etc). Will have a full writeup on our findings once we have some mileage on this in production.

JohnRusk · 2023-07-25T20:30:12Z

we are currently rolling out a rate limiter to production

@brandon-miles , do you have any update on this?

brandon-miles · 2023-07-25T21:32:22Z

@JohnRusk we've been using @vivekpatani's PR in production for over 2 years now without issues, and I believe it has greatly improved the etcd stability in our large clusters. We currently only rate limit 2 keys for our Kube clusters, which are essentially full pod or event listings. Since these are very large clusters, just a handful of full pod listings per second were enough to OOM our etcd nodes, hence our desire to guard against this.

While Kube has made some great progress with the API Priority and Fairness, I still believe etcd can/should guard itself, and that rate limiting is an important feature that should be integrated on the server side.

JohnRusk · 2023-07-25T22:39:14Z

That sounds good.

@vivekpatani and @brandon-miles , do you know anything about what the etcd team is thinking about this - e.g. are there plans for anything like this to be officially merged to master?

ahrtr · 2023-07-26T07:08:15Z

we've been using @vivekpatani's PR

The PR link?

fuweid · 2023-07-26T07:13:35Z

@ahrtr I guess it is #12289

serathius · 2023-07-26T10:06:07Z

For Kubernetes APF over etcd QoS, you should get much better results this way. Still we should look into this when we start tackling multi-tenant etcd. For now etcd has bigger issues then this and we don't have capacity to execute.

tangcong added 5 commits September 12, 2020 11:16

qos,etcdserver: define qos service api and message

c67dd70

*: generated protobuf and documentation

28ac011

qos: implement TokenBucket,MaxInflight ratelimiter

ae837bd

scripts: add qospb proto dirs

c2fa2eb

qos: generate qos.pb.go

39cd48a

tangcong changed the title ~~Implement Qos POC~~ Implement QoS POC Sep 12, 2020

tangcong changed the title ~~Implement QoS POC~~ Implement QoS/RateLimiter POC Sep 12, 2020

tangcong mentioned this pull request Sep 12, 2020

Etcd Server Rate Limiting #12164

Closed

tangcong added 2 commits September 12, 2020 14:08

qos: add rule executor

0954b94

qos: implement qos enforcer to match rule and limit speed

b5aeb10

tangcong force-pushed the qos-poc branch from bd97f94 to e8b1507 Compare September 12, 2020 06:15

tangcong added 6 commits September 12, 2020 15:01

qos: implement qos store api

f87cefe

etcdserver: implement QoS API

0ccf43a

etcdserver/api/v3rpc: implement qos v3rpc api

27bf62c

qos,etcdserver: fix import package path

c8f3b32

etcdserver: implement apply qos command

a5b09ba

clientv3: implement qos clientv3 lib

4761f7f

tangcong force-pushed the qos-poc branch from 819441f to caa426e Compare September 12, 2020 07:03

tangcong added 2 commits September 12, 2020 15:24

etcdctl: implement qos related command

4fd1fcb

etcdserver: add rate limiter for Range

3771faa

tangcong force-pushed the qos-poc branch from caa426e to 3771faa Compare September 12, 2020 07:25

tangcong added 2 commits September 12, 2020 17:14

bill-of-materials: updates for github.com/antonmedv/expr

58e16a4

go.mod/vendor: update

ef8d33f

stale bot added the stale label Dec 29, 2020

stale bot closed this Jan 19, 2021

tangcong deleted the qos-poc branch February 26, 2021 18:58

nolouch mentioned this pull request Apr 7, 2021

Proposal for pd self-protection tikv/pd#3497

Open

tangcong restored the qos-poc branch April 18, 2021 00:15

nolouch mentioned this pull request Jun 20, 2022

embed: add GRPCAdditionalServerOptions config #14066

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

*: Implement etcd QoS POC #12290

*: Implement etcd QoS POC #12290

tangcong commented Sep 12, 2020 •

edited

Loading

tangcong commented Sep 12, 2020

tangcong commented Sep 12, 2020 •

edited

Loading

tangcong commented Sep 12, 2020

cfc4n commented Sep 12, 2020

tangcong commented Sep 12, 2020

cfc4n commented Sep 12, 2020 •

edited

Loading

tangcong commented Sep 12, 2020 •

edited

Loading

tangcong commented Sep 23, 2020

gyuho commented Sep 23, 2020

tangcong commented Sep 25, 2020 •

edited

Loading

gyuho commented Sep 25, 2020

brandon-miles commented Sep 25, 2020

tangcong commented Sep 26, 2020 •

edited

Loading

tangcong commented Sep 26, 2020

WIZARD-CXY commented Sep 27, 2020

ptabor commented Sep 27, 2020

jpbetz commented Sep 28, 2020

brandon-miles commented Sep 28, 2020

jpbetz commented Sep 28, 2020 •

edited

Loading

WIZARD-CXY commented Sep 30, 2020

brandon-miles commented Sep 30, 2020

stale bot commented Dec 29, 2020

gyuho commented Apr 15, 2021 •

edited

Loading

tangcong commented Apr 23, 2021

brandon-miles commented Apr 23, 2021

JohnRusk commented Jul 25, 2023

brandon-miles commented Jul 25, 2023

JohnRusk commented Jul 25, 2023

ahrtr commented Jul 26, 2023

fuweid commented Jul 26, 2023

serathius commented Jul 26, 2023

*: Implement etcd QoS POC #12290

*: Implement etcd QoS POC #12290

Conversation

tangcong commented Sep 12, 2020 • edited Loading

Draft Proposal

Background

Goals

POC Implementation/Feature

Architecture/Flow

Demo

add qos rule

enable qos feature

etcd benchmark

POC Goal

tangcong commented Sep 12, 2020

tangcong commented Sep 12, 2020 • edited Loading

tangcong commented Sep 12, 2020

cfc4n commented Sep 12, 2020

tangcong commented Sep 12, 2020

cfc4n commented Sep 12, 2020 • edited Loading

tangcong commented Sep 12, 2020 • edited Loading

tangcong commented Sep 23, 2020

gyuho commented Sep 23, 2020

tangcong commented Sep 25, 2020 • edited Loading

gyuho commented Sep 25, 2020

brandon-miles commented Sep 25, 2020

tangcong commented Sep 26, 2020 • edited Loading

tangcong commented Sep 26, 2020

WIZARD-CXY commented Sep 27, 2020

ptabor commented Sep 27, 2020

jpbetz commented Sep 28, 2020

brandon-miles commented Sep 28, 2020

jpbetz commented Sep 28, 2020 • edited Loading

WIZARD-CXY commented Sep 30, 2020

brandon-miles commented Sep 30, 2020

stale bot commented Dec 29, 2020

gyuho commented Apr 15, 2021 • edited Loading

tangcong commented Apr 23, 2021

brandon-miles commented Apr 23, 2021

JohnRusk commented Jul 25, 2023

brandon-miles commented Jul 25, 2023

JohnRusk commented Jul 25, 2023

ahrtr commented Jul 26, 2023

fuweid commented Jul 26, 2023

serathius commented Jul 26, 2023

tangcong commented Sep 12, 2020 •

edited

Loading

tangcong commented Sep 12, 2020 •

edited

Loading

cfc4n commented Sep 12, 2020 •

edited

Loading

tangcong commented Sep 12, 2020 •

edited

Loading

tangcong commented Sep 25, 2020 •

edited

Loading

tangcong commented Sep 26, 2020 •

edited

Loading

jpbetz commented Sep 28, 2020 •

edited

Loading

gyuho commented Apr 15, 2021 •

edited

Loading