Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: option to wait for default_target_cluster on connect #113687

Merged
merged 6 commits into from
Nov 7, 2023

Conversation

stevendanna
Copy link
Collaborator

Virtual cluster startup is asynchronous. As a result, it can be hard to
orchestrate bringing a virtual cluster up since the client needs to
retry until the cluster is available.

This makes it a little easier by optionally allowing connections to
the default_target_cluster to be held until the cluster is online.

As a result, a user who does:

ALTER VIRTUAL CLUSTER app START SERVICE SHARED
SET CLUSTER SETTING server.controller.default_target_cluster = 'app'

Should be able to immediately connect to the same node without having
to handle retries while the virtual cluster starts up.

Note that we can extend this to wait for any
tenant. The restriction on only waiting on the default_target_cluster
is that we need to thread a little state into the server controller in
order to avoid pitfalls such as holding a lot of connections open for
longer than needed for a tenant that will never become available
because it doesn't exist.

In a future PR, we will set this option by default in the replication-source and replication-target
configuration profiles.

Built on #113666

Informs #111637

Release note: None

@stevendanna stevendanna requested review from a team as code owners November 2, 2023 17:20
@stevendanna stevendanna requested a review from a team November 2, 2023 17:20
@stevendanna stevendanna requested a review from a team as a code owner November 2, 2023 17:20
@stevendanna stevendanna requested review from abarganier and yuzefovich and removed request for a team November 2, 2023 17:20
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@stevendanna stevendanna requested a review from msbutler November 2, 2023 17:21
@stevendanna
Copy link
Collaborator Author

I recommend going commit-by-commit during the review since none of the steps are too complicated even though the overall diff ended up a bit large.

@stevendanna stevendanna added the backport-23.2.x Flags PRs that need to be backported to 23.2. label Nov 2, 2023
@stevendanna
Copy link
Collaborator Author

Pretty sure that the test failed because we are slow enough during stress that the 10s timeout isn't enough. I'll dig into this but I don't expect it to substantially change the shape of this.

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems good to me.

Reviewed 3 of 3 files at r1, 3 of 3 files at r2, 3 of 3 files at r3, 8 of 8 files at r4, 3 of 3 files at r5, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @msbutler, and @stevendanna)


pkg/multitenant/tenant_config.go line 46 at r2 (raw file):

)

// WaitForClusterStart, if enabled, instructs the tenant controller to

What's the benefit of having two settings rather than a single duration setting in which 0 means "disabled"? Also, why not enable it by default?


pkg/multitenant/tenant_config.go line 47 at r2 (raw file):

// WaitForClusterStart, if enabled, instructs the tenant controller to
// wait up to WaitForClusterStartTimeout for the defuault virtual

nit: s/defuault/default/.


pkg/multitenant/tenant_config.go line 56 at r2 (raw file):

)

// WaitForClusterStartTimeout is the amoutn of time the the tenant

nit: s/amoutn/amount/ and s/the the/the/.


pkg/server/server_controller_accessors.go line 21 at r4 (raw file):

// getServer retrieves a reference to the current server for the given
// tenant name.

nit: mention what the returned channel is about.

Copy link
Collaborator

@msbutler msbutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A fun exploration of server code!

sqlRunner.Exec(t, "CREATE TENANT hello")
sqlRunner.Exec(t, "ALTER VIRTUAL CLUSTER hello START SERVICE SHARED")
sqlRunner.Exec(t, "SET CLUSTER SETTING server.controller.default_target_cluster = 'hello'")
require.NoError(t, tryConnect())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this test would occasionally pass without your patch, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but only rarely.

if err == nil {
break
return s, nil
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should only wait if the err matches server for tenant %q not ready, correct? If so, I feel like this error string should have its own mark.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we want to retry on both errors. Both the existence of the tenant and the startup is async from the perspective of the server controller because we learn about tenants via a rangefeed.

t := timeutil.NewTimer()
defer t.Stop()
t.Reset(multitenant.WaitForClusterStartTimeout.Get(&c.st.SV))
for {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naive question: did you consider using tenantWaiter.DoChan() to orchestrate this dance instead? At a high level, its purpose seems to match the use case here, but there may be details I'm missing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it seems a bit better. Specifically so that cancellation on one inbound connection doesn't affect other connections in the same flight.

@stevendanna stevendanna force-pushed the wait-for-tenant-servers branch from e42a77c to e981f23 Compare November 6, 2023 13:14
Copy link
Collaborator Author

@stevendanna stevendanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reviews. Should be ready for another look.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier, @msbutler, and @yuzefovich)


pkg/multitenant/tenant_config.go line 46 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

What's the benefit of having two settings rather than a single duration setting in which 0 means "disabled"? Also, why not enable it by default?

I agree, one option would be better. Done. I've changed it to enabled by default, but depending on how long it takes me to get this in perhaps we backport it disabled.


pkg/multitenant/tenant_config.go line 47 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: s/defuault/default/.

Thanks. Done.


pkg/multitenant/tenant_config.go line 56 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: s/amoutn/amount/ and s/the the/the/.

Thanks. Done.

@stevendanna stevendanna force-pushed the wait-for-tenant-servers branch from e981f23 to 7112668 Compare November 6, 2023 13:51
@yuzefovich yuzefovich requested a review from msbutler November 6, 2023 15:39
Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 14 of 14 files at r6, 3 of 3 files at r7, 3 of 3 files at r8, 8 of 8 files at r9, 3 of 3 files at r10, 3 of 3 files at r11, 1 of 1 files at r12, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @abarganier, @msbutler, and @stevendanna)


pkg/multitenant/tenant_config.go line 48 at r11 (raw file):

// WaitForClusterStartTimeout is the amount of time the tenant
// controller will wait for the default virtual cluster to have an
// active SQL server, if WaitForClusterStart is true.

nit: WaitForClusterStart is no more.

@abarganier abarganier removed their request for review November 6, 2023 15:45
Virtual cluster startup is asynchronous. As a result, it can be hard to
orchestrate bringing a virtual cluster up since the client needs to
retry until the cluster is available.

This makes it a little easier by optionally allowing connections to
the default_target_cluster to be held until the cluster is online.

As a result, a user who does:

    ALTER VIRTUAL CLUSTER app START SERVICE SHARED
    SET CLUSTER SETTING server.controller.default_target_cluster = 'app'

Should be able to immediately connect to the same node without having
to handle retries while the virtual cluster starts up.

Note that in future PRs we can extend this to wait for _any_
tenant. The restriction on only waiting on the default_target_cluster
is that we need to thread a little state into the server controller in
order to avoid pitfalls such as holding a lot of connections open for
longer than needed for a tenant that will never become available
because it doesn't exist.

Informs cockroachdb#111637

Release note: None
Using singleflight avoids having a large number of connections polling
for the tenant to start up.

Release note: none
Rather than polling, this uses a channel to notify us when the set of
tenants has changed or when the tenant's state has changed.

Release note: None
We call getServer on every new connection. It seem prudent that we
wouldn't want concurrent connections to contend over a mutex when most
of the time nothing is writing to the data protected by this mutex.

Release note: None
@stevendanna stevendanna force-pushed the wait-for-tenant-servers branch from 7112668 to 6c3e892 Compare November 7, 2023 09:24
We can use a zero-wait time to indicate no waiting, saving a cluster
setting.

Epic: none

Release note: None
@stevendanna stevendanna force-pushed the wait-for-tenant-servers branch from 6c3e892 to 9a7c8b3 Compare November 7, 2023 09:28
@stevendanna
Copy link
Collaborator Author

bors r=yuzefovich

@craig
Copy link
Contributor

craig bot commented Nov 7, 2023

Build succeeded:

@craig craig bot merged commit 85e4a51 into cockroachdb:master Nov 7, 2023
3 checks passed
Copy link

blathers-crl bot commented Nov 7, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error setting reviewers, but backport branch blathers/backport-release-23.2-113687 is ready: POST https://api.github.com/repos/cockroachdb/cockroach/pulls/113933/requested_reviewers: 422 Reviews may only be requested from collaborators. One or more of the teams you specified is not a collaborator of the cockroachdb/cockroach repository. []

Backport to branch 23.2.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.2.x Flags PRs that need to be backported to 23.2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants