-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: option to wait for default_target_cluster on connect #113687
Conversation
I recommend going commit-by-commit during the review since none of the steps are too complicated even though the overall diff ended up a bit large. |
Pretty sure that the test failed because we are slow enough during stress that the 10s timeout isn't enough. I'll dig into this but I don't expect it to substantially change the shape of this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems good to me.
Reviewed 3 of 3 files at r1, 3 of 3 files at r2, 3 of 3 files at r3, 8 of 8 files at r4, 3 of 3 files at r5, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @msbutler, and @stevendanna)
pkg/multitenant/tenant_config.go
line 46 at r2 (raw file):
) // WaitForClusterStart, if enabled, instructs the tenant controller to
What's the benefit of having two settings rather than a single duration setting in which 0 means "disabled"? Also, why not enable it by default?
pkg/multitenant/tenant_config.go
line 47 at r2 (raw file):
// WaitForClusterStart, if enabled, instructs the tenant controller to // wait up to WaitForClusterStartTimeout for the defuault virtual
nit: s/defuault/default/
.
pkg/multitenant/tenant_config.go
line 56 at r2 (raw file):
) // WaitForClusterStartTimeout is the amoutn of time the the tenant
nit: s/amoutn/amount/
and s/the the/the/
.
pkg/server/server_controller_accessors.go
line 21 at r4 (raw file):
// getServer retrieves a reference to the current server for the given // tenant name.
nit: mention what the returned channel is about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A fun exploration of server code!
sqlRunner.Exec(t, "CREATE TENANT hello") | ||
sqlRunner.Exec(t, "ALTER VIRTUAL CLUSTER hello START SERVICE SHARED") | ||
sqlRunner.Exec(t, "SET CLUSTER SETTING server.controller.default_target_cluster = 'hello'") | ||
require.NoError(t, tryConnect()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this test would occasionally pass without your patch, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but only rarely.
if err == nil { | ||
break | ||
return s, nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should only wait if the err matches server for tenant %q not ready
, correct? If so, I feel like this error string should have its own mark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we want to retry on both errors. Both the existence of the tenant and the startup is async from the perspective of the server controller because we learn about tenants via a rangefeed.
t := timeutil.NewTimer() | ||
defer t.Stop() | ||
t.Reset(multitenant.WaitForClusterStartTimeout.Get(&c.st.SV)) | ||
for { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naive question: did you consider using tenantWaiter.DoChan()
to orchestrate this dance instead? At a high level, its purpose seems to match the use case here, but there may be details I'm missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it seems a bit better. Specifically so that cancellation on one inbound connection doesn't affect other connections in the same flight.
e42a77c
to
e981f23
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reviews. Should be ready for another look.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @msbutler, and @yuzefovich)
pkg/multitenant/tenant_config.go
line 46 at r2 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
What's the benefit of having two settings rather than a single duration setting in which 0 means "disabled"? Also, why not enable it by default?
I agree, one option would be better. Done. I've changed it to enabled by default, but depending on how long it takes me to get this in perhaps we backport it disabled.
pkg/multitenant/tenant_config.go
line 47 at r2 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit:
s/defuault/default/
.
Thanks. Done.
pkg/multitenant/tenant_config.go
line 56 at r2 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit:
s/amoutn/amount/
ands/the the/the/
.
Thanks. Done.
e981f23
to
7112668
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 14 of 14 files at r6, 3 of 3 files at r7, 3 of 3 files at r8, 8 of 8 files at r9, 3 of 3 files at r10, 3 of 3 files at r11, 1 of 1 files at r12, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @abarganier, @msbutler, and @stevendanna)
pkg/multitenant/tenant_config.go
line 48 at r11 (raw file):
// WaitForClusterStartTimeout is the amount of time the tenant // controller will wait for the default virtual cluster to have an // active SQL server, if WaitForClusterStart is true.
nit: WaitForClusterStart
is no more.
Virtual cluster startup is asynchronous. As a result, it can be hard to orchestrate bringing a virtual cluster up since the client needs to retry until the cluster is available. This makes it a little easier by optionally allowing connections to the default_target_cluster to be held until the cluster is online. As a result, a user who does: ALTER VIRTUAL CLUSTER app START SERVICE SHARED SET CLUSTER SETTING server.controller.default_target_cluster = 'app' Should be able to immediately connect to the same node without having to handle retries while the virtual cluster starts up. Note that in future PRs we can extend this to wait for _any_ tenant. The restriction on only waiting on the default_target_cluster is that we need to thread a little state into the server controller in order to avoid pitfalls such as holding a lot of connections open for longer than needed for a tenant that will never become available because it doesn't exist. Informs cockroachdb#111637 Release note: None
Using singleflight avoids having a large number of connections polling for the tenant to start up. Release note: none
Rather than polling, this uses a channel to notify us when the set of tenants has changed or when the tenant's state has changed. Release note: None
We call getServer on every new connection. It seem prudent that we wouldn't want concurrent connections to contend over a mutex when most of the time nothing is writing to the data protected by this mutex. Release note: None
7112668
to
6c3e892
Compare
We can use a zero-wait time to indicate no waiting, saving a cluster setting. Epic: none Release note: None
Release note: None
6c3e892
to
9a7c8b3
Compare
bors r=yuzefovich |
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error setting reviewers, but backport branch blathers/backport-release-23.2-113687 is ready: POST https://api.github.com/repos/cockroachdb/cockroach/pulls/113933/requested_reviewers: 422 Reviews may only be requested from collaborators. One or more of the teams you specified is not a collaborator of the cockroachdb/cockroach repository. [] Backport to branch 23.2.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
Virtual cluster startup is asynchronous. As a result, it can be hard to
orchestrate bringing a virtual cluster up since the client needs to
retry until the cluster is available.
This makes it a little easier by optionally allowing connections to
the default_target_cluster to be held until the cluster is online.
As a result, a user who does:
Should be able to immediately connect to the same node without having
to handle retries while the virtual cluster starts up.
Note that we can extend this to wait for any
tenant. The restriction on only waiting on the default_target_cluster
is that we need to thread a little state into the server controller in
order to avoid pitfalls such as holding a lot of connections open for
longer than needed for a tenant that will never become available
because it doesn't exist.
In a future PR, we will set this option by default in the replication-source and replication-target
configuration profiles.
Built on #113666
Informs #111637
Release note: None