server: option to wait for default_target_cluster on connect #113687

stevendanna · 2023-11-02T17:20:48Z

Virtual cluster startup is asynchronous. As a result, it can be hard to
orchestrate bringing a virtual cluster up since the client needs to
retry until the cluster is available.

This makes it a little easier by optionally allowing connections to
the default_target_cluster to be held until the cluster is online.

As a result, a user who does:

ALTER VIRTUAL CLUSTER app START SERVICE SHARED
SET CLUSTER SETTING server.controller.default_target_cluster = 'app'

Should be able to immediately connect to the same node without having
to handle retries while the virtual cluster starts up.

Note that we can extend this to wait for any
tenant. The restriction on only waiting on the default_target_cluster
is that we need to thread a little state into the server controller in
order to avoid pitfalls such as holding a lot of connections open for
longer than needed for a tenant that will never become available
because it doesn't exist.

In a future PR, we will set this option by default in the replication-source and replication-target
configuration profiles.

Built on #113666

Informs #111637

Release note: None

cockroach-teamcity · 2023-11-02T17:21:05Z

This change is

stevendanna · 2023-11-02T17:21:52Z

I recommend going commit-by-commit during the review since none of the steps are too complicated even though the overall diff ended up a bit large.

stevendanna · 2023-11-02T17:45:07Z

Pretty sure that the test failed because we are slow enough during stress that the 10s timeout isn't enough. I'll dig into this but I don't expect it to substantially change the shape of this.

yuzefovich

Seems good to me.

Reviewed 3 of 3 files at r1, 3 of 3 files at r2, 3 of 3 files at r3, 8 of 8 files at r4, 3 of 3 files at r5, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @msbutler, and @stevendanna)

pkg/multitenant/tenant_config.go line 46 at r2 (raw file):

)

// WaitForClusterStart, if enabled, instructs the tenant controller to

What's the benefit of having two settings rather than a single duration setting in which 0 means "disabled"? Also, why not enable it by default?

pkg/multitenant/tenant_config.go line 47 at r2 (raw file):

// WaitForClusterStart, if enabled, instructs the tenant controller to
// wait up to WaitForClusterStartTimeout for the defuault virtual

nit: s/defuault/default/.

pkg/multitenant/tenant_config.go line 56 at r2 (raw file):

)

// WaitForClusterStartTimeout is the amoutn of time the the tenant

nit: s/amoutn/amount/ and s/the the/the/.

pkg/server/server_controller_accessors.go line 21 at r4 (raw file):

// getServer retrieves a reference to the current server for the given
// tenant name.

nit: mention what the returned channel is about.

msbutler

A fun exploration of server code!

msbutler · 2023-11-03T16:49:35Z

pkg/server/server_controller_test.go

+	sqlRunner.Exec(t, "CREATE TENANT hello")
+	sqlRunner.Exec(t, "ALTER VIRTUAL CLUSTER hello START SERVICE SHARED")
+	sqlRunner.Exec(t, "SET CLUSTER SETTING server.controller.default_target_cluster = 'hello'")
+	require.NoError(t, tryConnect())


I assume this test would occasionally pass without your patch, right?

Yes, but only rarely.

msbutler · 2023-11-03T17:10:35Z

pkg/server/server_controller_sql.go

 			if err == nil {
-				break
+				return s, nil
+			}


we should only wait if the err matches server for tenant %q not ready, correct? If so, I feel like this error string should have its own mark.

No, we want to retry on both errors. Both the existence of the tenant and the startup is async from the perspective of the server controller because we learn about tenants via a rangefeed.

msbutler · 2023-11-03T17:15:58Z

pkg/server/server_controller_sql.go

+		t := timeutil.NewTimer()
+		defer t.Stop()
+		t.Reset(multitenant.WaitForClusterStartTimeout.Get(&c.st.SV))
+		for {


naive question: did you consider using tenantWaiter.DoChan() to orchestrate this dance instead? At a high level, its purpose seems to match the use case here, but there may be details I'm missing.

I agree that it seems a bit better. Specifically so that cancellation on one inbound connection doesn't affect other connections in the same flight.

stevendanna

Thanks for the reviews. Should be ready for another look.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @msbutler, and @yuzefovich)

pkg/multitenant/tenant_config.go line 46 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

What's the benefit of having two settings rather than a single duration setting in which 0 means "disabled"? Also, why not enable it by default?

I agree, one option would be better. Done. I've changed it to enabled by default, but depending on how long it takes me to get this in perhaps we backport it disabled.

pkg/multitenant/tenant_config.go line 47 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: s/defuault/default/.

Thanks. Done.

pkg/multitenant/tenant_config.go line 56 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: s/amoutn/amount/ and s/the the/the/.

Thanks. Done.

yuzefovich

Reviewed 14 of 14 files at r6, 3 of 3 files at r7, 3 of 3 files at r8, 8 of 8 files at r9, 3 of 3 files at r10, 3 of 3 files at r11, 1 of 1 files at r12, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @abarganier, @msbutler, and @stevendanna)

pkg/multitenant/tenant_config.go line 48 at r11 (raw file):

// WaitForClusterStartTimeout is the amount of time the tenant
// controller will wait for the default virtual cluster to have an
// active SQL server, if WaitForClusterStart is true.

nit: WaitForClusterStart is no more.

Virtual cluster startup is asynchronous. As a result, it can be hard to orchestrate bringing a virtual cluster up since the client needs to retry until the cluster is available. This makes it a little easier by optionally allowing connections to the default_target_cluster to be held until the cluster is online. As a result, a user who does: ALTER VIRTUAL CLUSTER app START SERVICE SHARED SET CLUSTER SETTING server.controller.default_target_cluster = 'app' Should be able to immediately connect to the same node without having to handle retries while the virtual cluster starts up. Note that in future PRs we can extend this to wait for _any_ tenant. The restriction on only waiting on the default_target_cluster is that we need to thread a little state into the server controller in order to avoid pitfalls such as holding a lot of connections open for longer than needed for a tenant that will never become available because it doesn't exist. Informs cockroachdb#111637 Release note: None

Using singleflight avoids having a large number of connections polling for the tenant to start up. Release note: none

Rather than polling, this uses a channel to notify us when the set of tenants has changed or when the tenant's state has changed. Release note: None

We call getServer on every new connection. It seem prudent that we wouldn't want concurrent connections to contend over a mutex when most of the time nothing is writing to the data protected by this mutex. Release note: None

We can use a zero-wait time to indicate no waiting, saving a cluster setting. Epic: none Release note: None

Release note: None

stevendanna · 2023-11-07T13:51:56Z

bors r=yuzefovich

craig · 2023-11-07T14:32:40Z

Build succeeded:

Bazel Essential CI (Cockroach)

blathers-crl · 2023-11-07T14:33:00Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error setting reviewers, but backport branch blathers/backport-release-23.2-113687 is ready: POST https://api.github.com/repos/cockroachdb/cockroach/pulls/113933/requested_reviewers: 422 Reviews may only be requested from collaborators. One or more of the teams you specified is not a collaborator of the cockroachdb/cockroach repository. []

Backport to branch 23.2.x failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

stevendanna requested review from a team as code owners November 2, 2023 17:20

stevendanna requested a review from a team November 2, 2023 17:20

stevendanna requested a review from a team as a code owner November 2, 2023 17:20

stevendanna requested review from abarganier and yuzefovich and removed request for a team November 2, 2023 17:20

stevendanna requested a review from msbutler November 2, 2023 17:21

stevendanna added the backport-23.2.x Flags PRs that need to be backported to 23.2. label Nov 2, 2023

yuzefovich reviewed Nov 2, 2023

View reviewed changes

msbutler reviewed Nov 3, 2023

View reviewed changes

stevendanna force-pushed the wait-for-tenant-servers branch from e42a77c to e981f23 Compare November 6, 2023 13:14

stevendanna commented Nov 6, 2023

View reviewed changes

stevendanna force-pushed the wait-for-tenant-servers branch from e981f23 to 7112668 Compare November 6, 2023 13:51

yuzefovich requested a review from msbutler November 6, 2023 15:39

yuzefovich approved these changes Nov 6, 2023

View reviewed changes

abarganier removed their request for review November 6, 2023 15:45

stevendanna added 4 commits November 7, 2023 09:23

server: use singleflight when waiting on sql server

e9e64d3

Using singleflight avoids having a large number of connections polling for the tenant to start up. Release note: none

server: avoid polling when waiting for tenants

bacf54e

Rather than polling, this uses a channel to notify us when the set of tenants has changed or when the tenant's state has changed. Release note: None

server: use RWMutex in server ochestrator

a5f46a1

We call getServer on every new connection. It seem prudent that we wouldn't want concurrent connections to contend over a mutex when most of the time nothing is writing to the data protected by this mutex. Release note: None

stevendanna force-pushed the wait-for-tenant-servers branch from 7112668 to 6c3e892 Compare November 7, 2023 09:24

stevendanna added 2 commits November 7, 2023 09:27

server: use a single setting for tenant waiting

a98d66e

We can use a zero-wait time to indicate no waiting, saving a cluster setting. Epic: none Release note: None

server: use singleflight.DoChan to avoid shared cancellation

9a7c8b3

Release note: None

stevendanna force-pushed the wait-for-tenant-servers branch from 6c3e892 to 9a7c8b3 Compare November 7, 2023 09:28

craig bot merged commit 85e4a51 into cockroachdb:master Nov 7, 2023
3 checks passed

blathers-crl bot mentioned this pull request Nov 7, 2023

release-23.2: server: option to wait for default_target_cluster on connect #113933

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: option to wait for default_target_cluster on connect #113687

server: option to wait for default_target_cluster on connect #113687

stevendanna commented Nov 2, 2023

cockroach-teamcity commented Nov 2, 2023

stevendanna commented Nov 2, 2023

stevendanna commented Nov 2, 2023

yuzefovich left a comment

msbutler left a comment

msbutler Nov 3, 2023

stevendanna Nov 6, 2023

msbutler Nov 3, 2023

stevendanna Nov 6, 2023

msbutler Nov 3, 2023

stevendanna Nov 6, 2023

stevendanna left a comment

yuzefovich left a comment

stevendanna commented Nov 7, 2023

craig bot commented Nov 7, 2023

blathers-crl bot commented Nov 7, 2023

server: option to wait for default_target_cluster on connect #113687

server: option to wait for default_target_cluster on connect #113687

Conversation

stevendanna commented Nov 2, 2023

cockroach-teamcity commented Nov 2, 2023

stevendanna commented Nov 2, 2023

stevendanna commented Nov 2, 2023

yuzefovich left a comment

Choose a reason for hiding this comment

msbutler left a comment

Choose a reason for hiding this comment

msbutler Nov 3, 2023

Choose a reason for hiding this comment

stevendanna Nov 6, 2023

Choose a reason for hiding this comment

msbutler Nov 3, 2023

Choose a reason for hiding this comment

stevendanna Nov 6, 2023

Choose a reason for hiding this comment

msbutler Nov 3, 2023

Choose a reason for hiding this comment

stevendanna Nov 6, 2023

Choose a reason for hiding this comment

stevendanna left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

stevendanna commented Nov 7, 2023

craig bot commented Nov 7, 2023

blathers-crl bot commented Nov 7, 2023