Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: dist sender sends too many RPCs #34999

Closed
jordanlewis opened this issue Feb 15, 2019 · 6 comments
Closed

kv: dist sender sends too many RPCs #34999

jordanlewis opened this issue Feb 15, 2019 · 6 comments
Assignees
Labels
A-admission-control C-performance Perf of queries or internals. Solution not expected to change functional behavior. no-issue-activity X-stale

Comments

@jordanlewis
Copy link
Member

The dist sender sends a batch to many ranges at once by chunking the batch up into one per range, and making a potentially asynchronous RPC (or local call, if the range lives on the local node) to each range.

Consider a case where there are 10 ranges per node, and a batch contains requests for all 10 ranges for a remote node. Then, the dist sender will make 10 asynchronous network RPCs all to the same node.

This seems inefficient. If the dist sender instead chunked up a batch by node, instead of by range, and sent a single RPC per node, then we'd have way fewer in-flight RPCs. The receiving node could then be in charge of further concurrency if it chooses. At the moment, the dist sender is the arbiter of concurrency - even for remote nodes, whose load it knows nothing about!

I'm assigning @nvanbenschoten for initial thoughts.

@jordanlewis jordanlewis added C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-admission-control labels Feb 15, 2019
@jordanlewis
Copy link
Member Author

@ajwerner might also be interested in this, as it has to do with admission control. We shouldn't make a ton of concurrent RPCs to a remote node when we don't know how loaded that node is. The remote node should have much more control of what it does with a set of incoming requests.

@ajwerner
Copy link
Contributor

This seems like a valuable and straightforward change. The distsender already has all the information it might need to split up the batch like this. I'd be happy to take a stab at this. Do you have good workloads in mind that hit this limit?

@jordanlewis
Copy link
Member Author

All workloads that use index or lookup joins on tables with reasonable numbers of ranges will hit thi slimit.

@ajwerner
Copy link
Contributor

After some reflection it's not clear that the concurrency has a huge overhead. The big issue is that hitting the concurrency limit is extremely expensive from a latency perspective (especially in a geo-distributed setting 😱. Sure we're trusting GRPC to be efficient at dealing with lots of requests and we're spawning lots of goroutines on the sender but it's not clear the cost of that in practice. Have you tried bumping defaultSenderConcurrency for the workloads you've been running and seeing its impact?

@ajwerner
Copy link
Contributor

I'm still going to try to type up the experiment but I worry it's going to have some unintended consequences and is going to further muddle some already pretty gnarly and complex code

@github-actions
Copy link

github-actions bot commented Jun 5, 2021

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
5 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-admission-control C-performance Perf of queries or internals. Solution not expected to change functional behavior. no-issue-activity X-stale
Projects
None yet
Development

No branches or pull requests

3 participants