-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
*: Implement etcd QoS POC #12290
*: Implement etcd QoS POC #12290
Conversation
@xiang90 @gyuho @jpbetz @jingyih @spzala please take a look. could we add the QoS feature as experimental API into v3.5? I'm not sure if we need the ability to customize the rules by condition expression, but someone suggested. The design and implementation support to add more rule executor(such as condition expression). But in 3.5, I suggest that built-in common rules(gRPCMethod/gRPCMethod/slowlog/traffic) can solve most of the scenarios. We can continue to improve based on community feedback in the future. |
This feature is so great. 👍🏻 |
thanks. It will play an important role in improving cluster stability in scenarios such as high concurrency and slow query. At the same time, multi-tenant scenarios also urgently need such a rate-limiting mechanism. so I look forward to the community can add the QoS features in 3.5, and there is a lot of work to be done later, we can work it together. |
I'm looking forward to do this feature with you. 😁 Do we need upload package |
ci is failed if i do not run ./scripts/updatebom.sh and updatedep.sh, it will generate bill-of-materials.json and update vendor. |
good. I think some rules can be calculated asynchronously(for example, I will take a look at your suggestions later,thanks. |
Thanks all for the POC and design docs.
We believe such full-featured QoS/QPS layer will be very hard to maintain compatibility with other projects (or code), or even confuse users. We'd like to keep etcd as minimal as possible, rather than adding another complexity to maintain. grpc-proxy is another example. Not everybody is using it but its lack of maturity has been slowing down or blocking the core development. Unless this can be simply implemented as a gRPC server middleware, I believe this is not the problem etcd is designed to solve. It's already pretty easy to embed rate limiter on client-side however you want, whereas built-in rate limiter will never satisfy all the requirements for all use cases. |
thanks. we see. The current etcd QoS Feature Design seems to meet most use cases(for example, related issues that are currently known can be resolved). do you support to do some early prototypes in the proxy? @gyuho |
@tangcong Yes, proxy as a separate component sounds better. Thanks. |
I'm a little disappointed this feature didn't make it to etcd. To me this is less effective at the proxy layer because we lose the ability to have server-side rules (DB size etc.). As an extremely heavy user of etcd and K8s, one of our biggest concerns is running out of DB space, and rate limiting once a certain threshold is hit — to me — was the killer feature of this patch. I understand the desire to keep etcd lean, and throw some of the complexity into the proxy, but in doing so some of the features here get watered down. Honestly, for our use case, I don't see much of a need to rate limit or provide QoS for etcd except if it's to save the cluster from an outage. Anyhow, thanks @tangcong and @vivekpatani for your work here, I think this would have been a great feature. |
thanks. @gyuho @brandon-miles To implement this feature at the proxy layer, it seems that the design proposal does not need major changes. We can implement the QoS API in the proxy, store the rules in the specified prefix key of etcd (user can specify QoS prefix key path), and implement QoS related commands in etcdctl. we can also implement rate limiting based on PercentDBQuotaUsed(we need to get db quota and db used size from etcd periodically). what do you think? |
If it is implemented at the proxy layer, will you use it? What is your current etcd deployment based on? @brandon-miles |
It is hard for the server side to get the dbsize under one prefix too I think. |
It's a common practice to have a separate monitoring system that if needed alerts or activates a rule using dynamic push of configuration to a service (that needs to be high priority). Hard-coding this within the server's logic sounds too custom. |
Have you tried k8s ResourceQuotas? Feedback on where you feel that fell short, if you've already tried it, could be a productive conversation. I'd also be curious to understand how you think could be solved by QoS on the etcd side. It's not obvious to me that a killer feature exists. I glanced through the design once more and I don't see anything that changed my mind. |
@jpbetz Yes, we use resource quotas, but to my knowledge they don't track resource size, just count. It's also tricky to estimate average resource size as well, since it can vary based on the cluster's intended usage. Operationally, having a switch that rate limits requests once a certain DB size is crossed would be a great safety net for us. I know there are a number of areas where you can mitigate this with K8s (priority and fairness etc), but I think it would also be good for etcd to not rely on K8s to always behave and do the right thing. |
Rate limiting is also a major liability. Adding code that delays or drops requests is an inherently risky thing to do. The list of issues I've been personally involved in where a rate limiter caused a production issue is quite long. Rather then adding unproven code to etcd to try to solve this problem. I think it would be much more helpful to start by first demonstrating an actual success case, and ideally put some real production mileage on it. We don't need to merge code into etcd to do that, and honestly, I think it would be much faster to iterate and improve a solution outside of the etcd code base. I personally would be very interested in the leanings and outcomes of such an experiment, and I think the rest of the etcd community would be too. |
I agree what jpbetz said. |
@jpbetz Sounds good. We'll report back once we have some mileage on our internal implementation. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions. |
I strongly believe the community would hugely benefit from built-in etcd load shedding mechanism. There are just too many unknowns and variables. We at AWS experienced the same need for gracefully handling the load spikes We may also require client-side behavior changes. For instance, what's the kube-apiserver behavior when it gets "too many requests" errors? Does kube-apiserver have back-off logic? Have we thought about enforcing kube-apiserver to do paginated calls, and for other controllers as well? As far as I know, large range queries are the most demanding workloads in Kubernetes land. So, I think we will need to revisit this in the near future. Please don't misunderstand our push-backs. I think this is the best POC we have got so far. |
/remove-stale |
Just to follow-up, we are currently rolling out a rate limiter to production based on a modified version of @vivekpatani's PR (#12289). Initial tests have been encouraging, and we now have some defense against bad Kube behaviors (large range queries etc). Will have a full writeup on our findings once we have some mileage on this in production. |
@brandon-miles , do you have any update on this? |
@JohnRusk we've been using @vivekpatani's PR in production for over 2 years now without issues, and I believe it has greatly improved the etcd stability in our large clusters. We currently only rate limit 2 keys for our Kube clusters, which are essentially full pod or event listings. Since these are very large clusters, just a handful of full pod listings per second were enough to OOM our etcd nodes, hence our desire to guard against this. While Kube has made some great progress with the API Priority and Fairness, I still believe etcd can/should guard itself, and that rate limiting is an important feature that should be integrated on the server side. |
That sounds good. @vivekpatani and @brandon-miles , do you know anything about what the etcd team is thinking about this - e.g. are there plans for anything like this to be officially merged to master? |
The PR link? |
For Kubernetes |
Draft Proposal
please see latest draft proposal: etcd QoS Feature Design
Background
As the metadata storage of kubernetes, etcd, stability is extremely important. However, at present, etcd will consume a lot of CPU, memory, and bandwidth resources in the face of some highly concurrent expensive read / write requests, resulting in high service load, even OOM, etc. Common expensive requests such as full keyspace fetch, a lot of event queries(list event), a lot of pod queries(list pods), and a lot of configmap writes.
At the same time, etcd currently has only an extremely simple speed limit protection. When etcd's committed index is greater than the applied index threshold is greater than 5000, it will reject all requests and return the error ErrTooManyRequests. Therefore, its shortcomings are obvious, it is impossible to accurately limit the speed of expensive read / write requests, and it cannot effectively prevent the cluster from being overloaded.
Therefore, I hope to implement a QoS feature that allows users to set the maximum qps under different paths and gRPC methods, and can automatically identify expensive read / write requests and limit their speed to prevent etcd overload and further improve etcd cluster Stability.
Goals
● Support the configuration QoS rules based on multiple object types (such as grpcMethod, grpcMethod + request key prefix path, traffic, cpu-intensive/RangeKeyNum, latency)
● Support multiple rate limiters for different scenarios (such as token bucket filter,maxInflightRequest, leaky bucket filter), with good scalability.
● Different QoS rules support the configuration of different rate limiters, priority. if a request satisfies multiple rules, the rule with the highest priority is selected.
● The memory consumption of the QoS feature can be controlled without significantly reducing the server performance.
● Users can customize rules simply through conditional expressions. (For example, dbUsedbytes is about to exceed db quota bytes, you can limit the rate of k8s event events)
POC Implementation/Feature
● Implement QoS API, rules can be adjusted dynamically.
● Support multiple rate limiter and rule type.
● Users can customize rules simply through conditional expressions.
Architecture/Flow
Demo
add qos rule
enable qos feature
etcd benchmark
./bin/tools/benchmark range /prefix --rate 200 --total 3000
POC Goal
The purpose of POC is to confirm whether the community agrees to add this feature and how to implement it. If the community agrees with this feature and roughly agrees with the implementation of this POC solution, I will continue to improve it and open a detailed task list issue, split it into small PRs, add documentation, comments, tests, etc. anyone interested in this feature can join in the development.