Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multitenant: demo startup performance is poor with multiple tenants and simulated latencies #76305

Closed
ajstorm opened this issue Feb 9, 2022 · 3 comments · Fixed by #92231
Closed
Labels
A-multiregion Related to multi-region A-multitenancy Related to multi-tenancy C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Comments

@ajstorm
Copy link
Collaborator

ajstorm commented Feb 9, 2022

cockroach demo --global --nodes 9 --empty takes several minutes to startup. When multi-tenancy is disabled, it takes only a few seconds. It's not clear what's causing this poor performance, but a quick peek into the logs shows that each tenant takes between 20 and 60 seconds to start.

I220209 13:16:17.066637 1 cli/democluster/demo_cluster.go:389 ⋮ [start-demo-cluster,phase=8] 2880  starting tenant nodes
I220209 13:16:17.066657 1 cli/democluster/demo_cluster.go:394 ⋮ [start-demo-cluster,phase=8] 2881  starting tenant node 0
I220209 13:16:38.293483 1 cli/democluster/demo_cluster.go:394 ⋮ [start-demo-cluster,phase=8,nsql1] 6010  starting tenant node 1
I220209 13:16:57.371877 1 cli/democluster/demo_cluster.go:394 ⋮ [start-demo-cluster,phase=8,nsql1] 8435  starting tenant node 2
I220209 13:17:15.858256 1 cli/democluster/demo_cluster.go:394 ⋮ [start-demo-cluster,phase=8,nsql1] 10585  starting tenant node 3
I220209 13:18:01.189788 1 cli/democluster/demo_cluster.go:394 ⋮ [start-demo-cluster,phase=8,nsql1] 13284  starting tenant node 4
I220209 13:18:52.356343 1 cli/democluster/demo_cluster.go:394 ⋮ [start-demo-cluster,phase=8,nsql1] 15015  starting tenant node 5
I220209 13:19:31.760119 1 cli/democluster/demo_cluster.go:394 ⋮ [start-demo-cluster,phase=8,nsql1] 17341  starting tenant node 6
I220209 13:20:20.155604 1 cli/democluster/demo_cluster.go:394 ⋮ [start-demo-cluster,phase=8,nsql1] 20181  starting tenant node 7
I220209 13:21:19.373402 1 cli/democluster/demo_cluster.go:394 ⋮ [start-demo-cluster,phase=8,nsql1] 23542  starting tenant node 8

More investigation is required.

Epic: CRDB-18596

Jira issue: CRDB-13070

@ajstorm ajstorm added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-multitenancy Related to multi-tenancy A-multiregion Related to multi-region labels Feb 9, 2022
@ajstorm
Copy link
Collaborator Author

ajstorm commented Feb 11, 2022

After doing a bit more digging, and chatting with @ajwerner, this is expected. At least one reason why we're seeing poor performance here is because we're performing startup migration for each of the new SQL servers that are being created. These startup migrations block the SQL server from starting up until they're completed. Additionally, it's a known issue (#73813) that these startup migrations are expensive, especially in cases where the cluster is widely distributed.

This is an issue that we'll likely need to address before we can enable multi-region Serverless.

@andy-kimball @vy-ton FYI.

@vy-ton
Copy link
Contributor

vy-ton commented Feb 15, 2022

What do these startup migrations do? And why do they need to run on a brand new cluster?

@ajwerner
Copy link
Contributor

@vy-ton see #73813

ajwerner added a commit to ajwerner/cockroach that referenced this issue Nov 21, 2022
Cluster creation and tenant setup is chatty. That's an okay thing: we don't
really care about cluster creation being that slow in general. In the case of
demo when we want to simulate latency and use tenants, it was particularly
painful. Starting the 9 tenants would take many minutes. This patch alleviates
this problem by keeping latency between the simulated nodes low until just
before we pass control to the user.

Fixes cockroachdb#76305

Release note (cli change): `cockroach demo --global` will now start up more
quickly. The latency which will be injected is not injected until after the
initial cluster is set up internally.
craig bot pushed a commit that referenced this issue Nov 23, 2022
91462: kvserver: Fix performance regression due to new call to collectSpansRead r=KaiSun314 a=KaiSun314

Fixes: #91374
Fixes: #91723

When we incorporated the use of response data in the load-based splitter, we called collectSpansRead, which is allocation heavy and computationally expensive, resulting in a performance regression.

To address this, this patch performs 3 optimizations:
1. Remove the call to collectSpansRead; instead, add a custom function to iterate over the batch of requests / responses and calculate the true spans
2. Instead of constructing a *spanset.SpanSet and finding the union of spans (which uses O(batch_size) memory), we directly compute the union of spans while iterating over the batch resulting in only O(1) memory used
3. Lazily compute the union of true spans only when it is truly needed i.e. we are under heavy load (e.g. >2500QPS) and a load-based splitter has been initialized

Cherry-picking this commit to the commit right before we incorporated response data in the load-based splitter (068845f) and running
```
~/benchdiff/benchdiff --old=068845ff72315f8b64f0e930c17c48f078203bc4 --new=abf61ce75c47e16bc39ed0e714f2e46f1d97eb7c --count=20 --post-checkout='./dev generate go' --run='KV/././rows=1$$' ./pkg/sql/tests
```
the output is:
<img width="909" alt="More Efficient Response Data Benchdiff" src="https://user-images.githubusercontent.com/28762332/200678800-d75240f9-a275-40cf-85f2-201aadca0355.png">

Release note: None

92231: cli,democluster: defer simulated latency until after cluster setup r=ajwerner a=ajwerner

### democluster,serverutils/regionlatency,rpc: extract code for simulating latency

We'll want to leverage these helpers in some tests to measure behavior under
simulated latency.

### cli,democluster: defer simulated latency until after cluster setup

Cluster creation and tenant setup is chatty. That's an okay thing: we don't
really care about cluster creation being that slow in general. In the case of
demo when we want to simulate latency and use tenants, it was particularly
painful. Starting the 9 tenants would take many minutes. This patch alleviates
this problem by keeping latency between the simulated nodes low until just
before we pass control to the user.

Fixes #76305

Release note (cli change): `cockroach demo --global` will now start up more
quickly. The latency which will be injected is not injected until after the
initial cluster is set up internally.

Co-authored-by: Kai Sun <[email protected]>
Co-authored-by: Andrew Werner <[email protected]>
@craig craig bot closed this as completed in 137b27b Nov 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-multiregion Related to multi-region A-multitenancy Related to multi-tenancy C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants