Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize RangeRequest Revision count #16510

Open
mehvaibh opened this issue Aug 30, 2023 · 13 comments
Open

Optimize RangeRequest Revision count #16510

mehvaibh opened this issue Aug 30, 2023 · 13 comments

Comments

@mehvaibh
Copy link
Contributor

mehvaibh commented Aug 30, 2023

What would you like to be added?

Present Behavior

Currently ETCD iterates on all matching keys in MVCC for Range query to calculate count in RangeResponse
Although it returns limited number of revisions from key(included) to end(excluded) at the given rev but iterates on all revisions to calculate the total number of revisions.

Change Requested

New no-count mode in RangeRequest structure.
By default, it will be false to not break current ETCD RangeResponse.Count contract and keep backward compatibility.

When explicitly set in RangeRequest , then ETCD will respect limit and break without iterating over all remaining revisions and returns fake Count in RangeResponse which client/consumers should ignore.

This is motivated by some previous PR and discussions, in which limit enforcement has to be reverted because of breaking ETCD contract and thus breaking K8s remainingItemCount feature.
Previous PR references:

Why is this needed?

Performance Improvement for Paginated Range query.

As with no-count feature, ETCD does not need to count all revisions to calculate Count for RangeResponse.
This means retrieving X objects time is not dependent on total number of matching objects in ETCD.

Pasting some results from my experiment.
Here :

  • Non-Optimized server refer to ETCD build from main branch
  • Optimized server just have one change ,i.e. break revisions loop when limit is reached.

Etcd InstanceType: m5.4xlarge

Test cmd: Get first 2 game-config objects from large number of game-configs present in ETCD (Column 2 in below results)

time etcdctl get --prefix /registry/configmaps/default/game-config --keys-only --limit 2

SNO # ConfigMap Non optimized server(ms) Optimized server(ms) % gain (total time) Time only for Revisions() Non-Optimized Time only for Revisions() Optimized
1 10,000 20 18 10 600µs 6.348µs
2 50,000 23 18 21.73913 7.703931ms 6.348µs
3 100,000 32 18 43.75 16.22648ms 6.348µs
4 500,000 200 18 91 171.659658ms 6.348µs

Time only for Revisions() columns in above measures time only for this code block

Please let me know if this change request look good then I can prepare PR for code changes.

@mehvaibh
Copy link
Contributor Author

would like your opinion on this
@ahrtr @serathius @chaochn47 @geetasg

@serathius
Copy link
Member

Kubernetes depends on count, so doesn't look like this feature will get much use.

@ahrtr
Copy link
Member

ahrtr commented Aug 31, 2023

Making the condition being false (e.g. no sorting, etc.) can get the same better performance? @mehvaibh

@mehvaibh
Copy link
Contributor Author

Kubernetes depends on count, so doesn't look like this feature will get much use.

Yes, API Server uses count for calculating remainingItemCount.

If we implement this feature, API Server can also take advantage of this.
I have a separate proposal for that which I will start discussion with API Server.

But here, I wanted to check feasibility from ETCD POV.

@mehvaibh
Copy link
Contributor Author

Making the condition being false (e.g. no sorting, etc.) can get the same better performance?

Not sure I understand this correctly but seems like revisions will loop on to calculate total even if limit is less than total. Pls correct me if I am wrong.

@ahrtr
Copy link
Member

ahrtr commented Sep 1, 2023

In your example, you set the limit to 2. If you change it to a larger limit (e.g. 1K), I think the performance difference may be negligible (can you double check that?). In large range query, the bottleneck should be reading data from the storage engine instead of counting the total number in memory.

Adding the no-count feature only makes sense for the extreme use case (e.g. get only 2 records from a large number of matched records e.g. 500K). Also note that it's only valid when no sort is needed.

@serathius
Copy link
Member

I think Kubernetes would benefit much more from serving pagination from watch cache instead of trying to optimise patination on etcd side, but I'm biased by my own work https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/2340-Consistent-reads-from-cache.

cc @mborsz

@mehvaibh
Copy link
Contributor Author

mehvaibh commented Sep 6, 2023

time ./etcdctl get --prefix /registry/configmaps/default  --limit 2000 > /dev/null
SNO # ConfigMap Non optimized server(ms) Optimized server(ms) % gain (total time)
1 50,000 30 21 30
2 100,000 40 21 47.5

@ahrtr I tried with 2K page limit. The optimized server(respects limit) latency remains the same while non-optimized server(current ETCD) latencies increases with number of objects.

Also note that it's only valid when no sort is needed.

Agree. Sort range queries overrides the limit and thus all objects needs to be retrieved regardless.

This optimization can benefit no sort paginated range request.

@mehvaibh
Copy link
Contributor Author

mehvaibh commented Sep 6, 2023

I think Kubernetes would benefit much more from serving pagination from watch cache instead of trying to optimise patination on etcd side, but I'm biased by my own work

@serathius Thanks for the update!

I think this optimization can also help reduce paginated no sort range queries latencies specially when number of objects are high. As ETCD needs to spend relatively less time completing paginated no sort range transactions it will be helpful in large DB with high number of objects count and under significant load (i.e. varying mix of read/write transactions).
Please correct me if my understanding is wrong.

@ahrtr
Copy link
Member

ahrtr commented Sep 7, 2023

@ahrtr I tried with 2K page limit. The optimized server(respects limit) latency remains the same while non-optimized server(current ETCD) latencies increases with number of objects.

thx for the feedback. I am not against it as long as the corresponding K8s's proposal can be accepted [in other words, as long as there are valid use cases].

@fuweid
Copy link
Member

fuweid commented Sep 7, 2023

This proposal looks like continue field in the kubernetes API or the more field in the Range response.
And the apiserver only uses the more field for pagination but the GA feature RemainingItemCount relies on this Count.

https://github.com/kubernetes/apiserver/blob/ea59eb34554704ce1e2336717c668f16f763bc65/pkg/storage/etcd3/store.go#L732

Most of times, the client should get result from apiserver's cache. But the client can still force query to etcd cluster since the kubernetes allows user to do that. I think it shoud be good if we can prevent burst cpu usage from comparing all keyspaces.

@serathius
Copy link
Member

And the apiserver only uses the more field for pagination but the GA feature RemainingItemCount relies on this Count.

This ^, Kubernetes will not break GA feature. Implementing this on etcd is worthless until you make proposal that is accepted on Kubernetes side.

But the client can still force query to etcd cluster since the kubernetes allows user to do that

Not true, watch cache is just not used for all types of requests. KEP https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/2340-Consistent-reads-from-cache is about broadening types of requests that are served from apiserver. End goal is to serve all requests from cache.

@siyuanfoundation
Copy link
Contributor

This feature can be useful to check the health of serializable read: use the first key to check reading from bolt db is ok. Existing check uses Range(non-exist key), which does not hit the db file. More context in #16007

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

6 participants