Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*: Implement etcd QoS POC #12290

Closed
wants to merge 17 commits into from
Closed

*: Implement etcd QoS POC #12290

wants to merge 17 commits into from

Conversation

tangcong
Copy link
Contributor

@tangcong tangcong commented Sep 12, 2020

Draft Proposal

please see latest draft proposal: etcd QoS Feature Design

Background

As the metadata storage of kubernetes, etcd, stability is extremely important. However, at present, etcd will consume a lot of CPU, memory, and bandwidth resources in the face of some highly concurrent expensive read / write requests, resulting in high service load, even OOM, etc. Common expensive requests such as full keyspace fetch, a lot of event queries(list event), a lot of pod queries(list pods), and a lot of configmap writes.

At the same time, etcd currently has only an extremely simple speed limit protection. When etcd's committed index is greater than the applied index threshold is greater than 5000, it will reject all requests and return the error ErrTooManyRequests. Therefore, its shortcomings are obvious, it is impossible to accurately limit the speed of expensive read / write requests, and it cannot effectively prevent the cluster from being overloaded.

Therefore, I hope to implement a QoS feature that allows users to set the maximum qps under different paths and gRPC methods, and can automatically identify expensive read / write requests and limit their speed to prevent etcd overload and further improve etcd cluster Stability.

Goals

● Support the configuration QoS rules based on multiple object types (such as grpcMethod, grpcMethod + request key prefix path, traffic, cpu-intensive/RangeKeyNum, latency)

● Support multiple rate limiters for different scenarios (such as token bucket filter,maxInflightRequest, leaky bucket filter), with good scalability.

● Different QoS rules support the configuration of different rate limiters, priority. if a request satisfies multiple rules, the rule with the highest priority is selected.

● The memory consumption of the QoS feature can be controlled without significantly reducing the server performance.

● Users can customize rules simply through conditional expressions. (For example, dbUsedbytes is about to exceed db quota bytes, you can limit the rate of k8s event events)

POC Implementation/Feature

● Implement QoS API, rules can be adjusted dynamically.
● Support multiple rate limiter and rule type.
● Users can customize rules simply through conditional expressions.

Architecture/Flow

image

Demo

add qos rule

etcdctl qos add rule-gRPCMethod-1 gRPCMethod Range /prefix 100 TokenBucket 1 0 ""
NAME:
	qos add - adds a qos rule into the cluster

USAGE:
	etcdctl qos add <rule name> <rule type> <subject> <key prefix> <qps> <ratelimiter> <priority(0-9)> <threshold> <condition> [flags]

DESCRIPTION:
	add,update the QoS Rule into the store.
	rule name is the unique name of a qos rule,you can get/update/delete qos rule by name.
	rule type supports gRPCMethod,slowquery,traffic,custom.
	subject supports gRPCMethod(such as Range,Put,Txn,Authenticate) and RangeKeyNum,DBUsedByte.
	key prefix,if it is not empty, the rule will only take effect when the user key matches it..
	qps indicates the maximum number of requests allowed to pass.
	ratelimiter supports TokenBucket,MaxInflight,etc.
	priority indicates the priority of this rule,when a request satisfies multiple rules, the rule with the highest priority is selected.
	threshold indicates rule will take effect when subject(RangeKeyNum,DBUsedByte) > threshold, it runs more fastly than condition expr.
	condition supports complex expression.it supports following keywords,RangeKeyNum,DBUsedByte,gRPCMethod,key and prefix function.
	RangeKeyNum: the total number of keys traversed by the query request
	DBUsedByte: the bytes of backend db used.

	for example:
	$ etcdctl qos add/update rule-gRPCMethod-1 gRPCMethod Range /prefix 100 TokenBucket 1 0 ""
	$ etcdctl qos add/update rule-slowquery-1 slowquery RangeKeyNum "" 100 MaxInflight 9 1000 ""
	$ etcdctl qos add/update rule-custom-1 custom "" "" 100 TokenBucket 2 0  'DBUsedByte > 4096 && prefix(key,"/config") && gRPCMethod == "Put"'

enable qos feature

etcdctl qos enable

etcd benchmark

./bin/tools/benchmark range /prefix --rate 200 --total 3000

13:18:20 etcd1 | {"level":"info","ts":"2020-09-12T13:18:20.173+0800","caller":"qos/enforcer.go:103","msg":"matched rule","object":{"GRPCMethod":"Range","Key":"/prefix","RangeKeyNum":0,"DBUsedByte":0,"MatchedRuleName":""},"rule":"rule_name:\"rule-gRPCMethod-1\" rule_type:\"gRPCMethod\" subject:<name:\"Range\" prefix:\"/prefix\" > qps:100 priority:1 ratelimiter:\"TokenBucket\" "}

image

POC Goal

The purpose of POC is to confirm whether the community agrees to add this feature and how to implement it. If the community agrees with this feature and roughly agrees with the implementation of this POC solution, I will continue to improve it and open a detailed task list issue, split it into small PRs, add documentation, comments, tests, etc. anyone interested in this feature can join in the development.

@tangcong tangcong changed the title Implement Qos POC Implement QoS POC Sep 12, 2020
@tangcong tangcong changed the title Implement QoS POC Implement QoS/RateLimiter POC Sep 12, 2020
@tangcong
Copy link
Contributor Author

ralated issues:
#8483 #12164 #10084 #7381

@tangcong
Copy link
Contributor Author

tangcong commented Sep 12, 2020

@xiang90 @gyuho @jpbetz @jingyih @spzala please take a look. could we add the QoS feature as experimental API into v3.5? I'm not sure if we need the ability to customize the rules by condition expression, but someone suggested. The design and implementation support to add more rule executor(such as condition expression). But in 3.5, I suggest that built-in common rules(gRPCMethod/gRPCMethod/slowlog/traffic) can solve most of the scenarios. We can continue to improve based on community feedback in the future.
Looking forward to your suggestions. thanks.

@tangcong
Copy link
Contributor Author

cc @ptabor @wenjiaswe

@cfc4n
Copy link
Contributor

cfc4n commented Sep 12, 2020

This feature is so great. 👍🏻

@tangcong
Copy link
Contributor Author

This feature is so great. 👍🏻

thanks. It will play an important role in improving cluster stability in scenarios such as high concurrency and slow query. At the same time, multi-tenant scenarios also urgently need such a rate-limiting mechanism.

so I look forward to the community can add the QoS features in 3.5, and there is a lot of work to be done later, we can work it together.

@cfc4n
Copy link
Contributor

cfc4n commented Sep 12, 2020

This feature is so great. 👍🏻

thanks. It will play an important role in improving cluster stability in scenarios such as high concurrency and slow query. At the same time, multi-tenant scenarios also urgently need such a rate-limiting mechanism.

so I look forward to the community can add the QoS features in 3.5, and there is a lot of work to be done later, we can work it together.

I'm looking forward to do this feature with you. 😁

Do we need upload package github.com/antonmedv/expr into vendor ? Don't used go mod to auto download it?

EG:
https://github.com/etcd-io/etcd/blob/ef8d33f1efbf796ddeea806c84c6807b85559335/vendor/github.com/antonmedv/expr/ast/node.go

@tangcong
Copy link
Contributor Author

tangcong commented Sep 12, 2020

This feature is so great. 👍🏻

thanks. It will play an important role in improving cluster stability in scenarios such as high concurrency and slow query. At the same time, multi-tenant scenarios also urgently need such a rate-limiting mechanism.
so I look forward to the community can add the QoS features in 3.5, and there is a lot of work to be done later, we can work it together.

I'm looking forward to do this feature with you. 😁

Do we need upload package github.com/antonmedv/expr into vendor ? Don't used go mod to auto download it?

EG:
https://github.com/etcd-io/etcd/blob/ef8d33f1efbf796ddeea806c84c6807b85559335/vendor/github.com/antonmedv/expr/ast/node.go

ci is failed if i do not run ./scripts/updatebom.sh and updatedep.sh, it will generate bill-of-materials.json and update vendor.
can you compile locally after downloading my branch?

@tangcong
Copy link
Contributor Author

Yep, just wanted to put forth what are the problem it's solving. I'm assuming the storage rule keeps running continuously to check whether if the sizeIsUse has crossed threshold in a go routine of some sort? @tangcong

I've added the implementation suggestion to the doc too.

Thanks for the lucid explanation.

good. I think some rules can be calculated asynchronously(for example, sizeIsUse has crossed threshold in a go routine), which can improve performance. Some rules that need to match user request parameters may be calculated synchronously, but this will also be a fast operation.

I will take a look at your suggestions later,thanks.

@gyuho
Copy link
Contributor

gyuho commented Sep 23, 2020

@tangcong @vivekpatani

Thanks all for the POC and design docs.

ok, got it. if we implement it based proxy, it will not change the etcd server binary, but most users do not use proxy. this feature will be introduced behind an experimental flag. If you do not enable the experimental flag, etcd QoS API is not available and will not create related buckets for QoS in boltdb.

We believe such full-featured QoS/QPS layer will be very hard to maintain compatibility with other projects (or code), or even confuse users. We'd like to keep etcd as minimal as possible, rather than adding another complexity to maintain. grpc-proxy is another example. Not everybody is using it but its lack of maturity has been slowing down or blocking the core development.

Unless this can be simply implemented as a gRPC server middleware, I believe this is not the problem etcd is designed to solve. It's already pretty easy to embed rate limiter on client-side however you want, whereas built-in rate limiter will never satisfy all the requirements for all use cases.

@tangcong
Copy link
Contributor Author

tangcong commented Sep 25, 2020

thanks. we see. The current etcd QoS Feature Design seems to meet most use cases(for example, related issues that are currently known can be resolved). do you support to do some early prototypes in the proxy? @gyuho

@gyuho
Copy link
Contributor

gyuho commented Sep 25, 2020

@tangcong Yes, proxy as a separate component sounds better. Thanks.

@brandon-miles
Copy link

I'm a little disappointed this feature didn't make it to etcd. To me this is less effective at the proxy layer because we lose the ability to have server-side rules (DB size etc.). As an extremely heavy user of etcd and K8s, one of our biggest concerns is running out of DB space, and rate limiting once a certain threshold is hit — to me — was the killer feature of this patch. I understand the desire to keep etcd lean, and throw some of the complexity into the proxy, but in doing so some of the features here get watered down. Honestly, for our use case, I don't see much of a need to rate limit or provide QoS for etcd except if it's to save the cluster from an outage.

Anyhow, thanks @tangcong and @vivekpatani for your work here, I think this would have been a great feature.

@tangcong
Copy link
Contributor Author

tangcong commented Sep 26, 2020

thanks. @gyuho @brandon-miles To implement this feature at the proxy layer, it seems that the design proposal does not need major changes. We can implement the QoS API in the proxy, store the rules in the specified prefix key of etcd (user can specify QoS prefix key path), and implement QoS related commands in etcdctl. we can also implement rate limiting based on PercentDBQuotaUsed(we need to get db quota and db used size from etcd periodically). what do you think?

@tangcong
Copy link
Contributor Author

If it is implemented at the proxy layer, will you use it? What is your current etcd deployment based on? @brandon-miles

@WIZARD-CXY
Copy link
Contributor

To me this is less effective at the proxy layer because we lose the ability to have server-side rules (DB size etc.).

It is hard for the server side to get the dbsize under one prefix too I think.

@ptabor
Copy link
Contributor

ptabor commented Sep 27, 2020

It's a common practice to have a separate monitoring system that if needed alerts or activates a rule using dynamic push of configuration to a service (that needs to be high priority). Hard-coding this within the server's logic sounds too custom.

@jpbetz
Copy link
Contributor

jpbetz commented Sep 28, 2020

As an extremely heavy user of etcd and K8s, one of our biggest concerns is running out of DB space, and rate limiting once a certain threshold is hit — to me — was the killer feature of this patch

Have you tried k8s ResourceQuotas? Feedback on where you feel that fell short, if you've already tried it, could be a productive conversation.

I'd also be curious to understand how you think could be solved by QoS on the etcd side. It's not obvious to me that a killer feature exists. I glanced through the design once more and I don't see anything that changed my mind.

@brandon-miles
Copy link

@jpbetz Yes, we use resource quotas, but to my knowledge they don't track resource size, just count. It's also tricky to estimate average resource size as well, since it can vary based on the cluster's intended usage.

Operationally, having a switch that rate limits requests once a certain DB size is crossed would be a great safety net for us. I know there are a number of areas where you can mitigate this with K8s (priority and fairness etc), but I think it would also be good for etcd to not rely on K8s to always behave and do the right thing.

@jpbetz
Copy link
Contributor

jpbetz commented Sep 28, 2020

@jpbetz Yes, we use resource quotas, but to my knowledge they don't track resource size, just count. It's also tricky to estimate average resource size as well, since it can vary based on the cluster's intended usage.
Operationally, having a switch that rate limits requests once a certain DB size is crossed would be a great safety net for us. I know there are a number of areas where you can mitigate this with K8s (priority and fairness etc), but I think it would also be good for etcd to not rely on K8s to always behave and do the right thing.

Rate limiting is also a major liability. Adding code that delays or drops requests is an inherently risky thing to do. The list of issues I've been personally involved in where a rate limiter caused a production issue is quite long.

Rather then adding unproven code to etcd to try to solve this problem. I think it would be much more helpful to start by first demonstrating an actual success case, and ideally put some real production mileage on it. We don't need to merge code into etcd to do that, and honestly, I think it would be much faster to iterate and improve a solution outside of the etcd code base. I personally would be very interested in the leanings and outcomes of such an experiment, and I think the rest of the etcd community would be too.

@WIZARD-CXY
Copy link
Contributor

@jpbetz Yes, we use resource quotas, but to my knowledge they don't track resource size, just count. It's also tricky to estimate average resource size as well, since it can vary based on the cluster's intended usage.
Operationally, having a switch that rate limits requests once a certain DB size is crossed would be a great safety net for us. I know there are a number of areas where you can mitigate this with K8s (priority and fairness etc), but I think it would also be good for etcd to not rely on K8s to always behave and do the right thing.

Rate limiting is also a major liability. Adding code that delays or drops requests is an inherently risky thing to do. The list of issues I've been personally involved in where a rate limiter caused a production issue is quite long.

Rather then adding unproven code to etcd to try to solve this problem. I think it would be much more helpful to start by first demonstrating an actual success case, and ideally put some real production mileage on it. We don't need to merge code into etcd to do that, and honestly, I think it would be much faster to iterate and improve a solution outside of the etcd code base. I personally would be very interested in the leanings and outcomes of such an experiment, and I think the rest of the etcd community would be too.

I agree what jpbetz said.

@brandon-miles
Copy link

@jpbetz Sounds good. We'll report back once we have some mileage on our internal implementation.

@stale
Copy link

stale bot commented Dec 29, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Dec 29, 2020
@stale stale bot closed this Jan 19, 2021
@tangcong tangcong deleted the qos-poc branch February 26, 2021 18:58
@gyuho
Copy link
Contributor

gyuho commented Apr 15, 2021

@tangcong @WIZARD-CXY

I strongly believe the community would hugely benefit from built-in etcd load shedding mechanism. There are just too many unknowns and variables. We at AWS experienced the same need for gracefully handling the load spikes

We may also require client-side behavior changes. For instance, what's the kube-apiserver behavior when it gets "too many requests" errors? Does kube-apiserver have back-off logic? Have we thought about enforcing kube-apiserver to do paginated calls, and for other controllers as well? As far as I know, large range queries are the most demanding workloads in Kubernetes land.

So, I think we will need to revisit this in the near future. Please don't misunderstand our push-backs. I think this is the best POC we have got so far.

@tangcong tangcong restored the qos-poc branch April 18, 2021 00:15
@tangcong
Copy link
Contributor Author

/remove-stale

@brandon-miles
Copy link

Just to follow-up, we are currently rolling out a rate limiter to production based on a modified version of @vivekpatani's PR (#12289). Initial tests have been encouraging, and we now have some defense against bad Kube behaviors (large range queries etc). Will have a full writeup on our findings once we have some mileage on this in production.

@JohnRusk
Copy link

we are currently rolling out a rate limiter to production

@brandon-miles , do you have any update on this?

@brandon-miles
Copy link

@JohnRusk we've been using @vivekpatani's PR in production for over 2 years now without issues, and I believe it has greatly improved the etcd stability in our large clusters. We currently only rate limit 2 keys for our Kube clusters, which are essentially full pod or event listings. Since these are very large clusters, just a handful of full pod listings per second were enough to OOM our etcd nodes, hence our desire to guard against this.

While Kube has made some great progress with the API Priority and Fairness, I still believe etcd can/should guard itself, and that rate limiting is an important feature that should be integrated on the server side.

@JohnRusk
Copy link

That sounds good.

@vivekpatani and @brandon-miles , do you know anything about what the etcd team is thinking about this - e.g. are there plans for anything like this to be officially merged to master?

@ahrtr
Copy link
Member

ahrtr commented Jul 26, 2023

we've been using @vivekpatani's PR

The PR link?

@fuweid
Copy link
Member

fuweid commented Jul 26, 2023

@ahrtr I guess it is #12289

@serathius
Copy link
Member

For Kubernetes APF over etcd QoS, you should get much better results this way. Still we should look into this when we start tackling multi-tenant etcd. For now etcd has bigger issues then this and we don't have capacity to execute.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.