Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: wait for active configuration profiles during startup #113666

Merged
merged 1 commit into from
Nov 6, 2023

Conversation

stevendanna
Copy link
Collaborator

@stevendanna stevendanna commented Nov 2, 2023

Before, we did not wait for these configuration profiles at startup. This produces confusing behaviour where the behaviour of the cluster changes substantially a few moments after startup.

For instance the replication-source configuration profile runs

SET CLUSTER SETTING server.controller.default_target_cluster = 'application'

If this is delayed until after we start accepting connections, then for a few moments new connections will go to the system tenant and then afterwards they will go to the application tenant.

Here, we narrow the window of confusion by waiting for the configuration profiles to be complete during the preStart sequence in SQL.

Note that this doesn't solve startup coordination. There are still at least two problems:

  1. Async server startup still means the user may get an error for a few moments after startup until the server is started.

  2. Async settings propagation still means that the default_target_cluster setting can still be delayed.

Informs #111637

Release note: None

@stevendanna stevendanna requested review from a team as code owners November 2, 2023 13:15
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@stevendanna stevendanna added the backport-23.2.x Flags PRs that need to be backported to 23.2. label Nov 2, 2023
Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I only have a few nits :lgtm:

Reviewed 3 of 3 files at r1, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @adityamaru and @stevendanna)


pkg/server/server_sql.go line 1774 at r1 (raw file):

	envs := s.execCfg.AutoConfigProvider.ActiveEnvironments()
	if len(envs) == 0 {
		log.Infof(ctx, "auto-configuration environments not set or already complete")

nit: preStart is called in both system and application tenants. Do we have any auto config changes in the system tenant? If not, should we consider omitting this log message for system tenants then?


pkg/server/autoconfig/auto_config_test.go line 109 at r1 (raw file):

}

func (p *testProvider) ActiveEnvironments() []autoconfigpb.EnvironmentID {

nit: should we adjust the contract on Provider.ActiveEnvironments to explicitly mention that nil (or zero length slice) is returned once there are no more tasks left to run?


pkg/server/autoconfig/auto_config_test.go line 348 at r1 (raw file):

				Provider: provider,
			},
			// Server: &server.TestingKnobs{

nit: leftover.

Before, we did not wait for these configuration profiles at
startup. This produces confusing behaviour where the behaviour of the
cluster changes substantially a few moments after startup.

For instance the replication-source configuration profile runs

  SET CLUSTER SETT server.controller.default_target_cluster = 'application'

If this is delayed until after we start accepting connections, then
for a few moments new connections will go to the system tenant and
then afterwards they will go to the application tenant.

Here, we narrow the window of confusion by waiting for the
configuration profiles to be complete during the preStart sequence in
SQL.

Note that this doesn't solve startup coordination. There are still at
least two problems:

1. Async server startup still means the user may get an error for a
   few moments after startup until the server is started.

2. Async settings propagation still means that the
  default_target_cluster setting can still be delayed.

Informs cockroachdb#111637

Release note: None
@stevendanna stevendanna force-pushed the wait-for-config-profiles branch from a4ebf87 to 0969935 Compare November 6, 2023 12:46
@stevendanna
Copy link
Collaborator Author

bors r=yuzefovich

@craig
Copy link
Contributor

craig bot commented Nov 6, 2023

Build failed:

@stevendanna
Copy link
Collaborator Author

bors retry

@craig
Copy link
Contributor

craig bot commented Nov 6, 2023

Build succeeded:

@craig craig bot merged commit 56cf731 into cockroachdb:master Nov 6, 2023
3 checks passed
craig bot pushed a commit that referenced this pull request Nov 7, 2023
113687: server: option to wait for default_target_cluster on connect r=yuzefovich a=stevendanna

Virtual cluster startup is asynchronous. As a result, it can be hard to
orchestrate bringing a virtual cluster up since the client needs to
retry until the cluster is available.

This makes it a little easier by optionally allowing connections to
the default_target_cluster to be held until the cluster is online.

As a result, a user who does:

    ALTER VIRTUAL CLUSTER app START SERVICE SHARED
    SET CLUSTER SETTING server.controller.default_target_cluster = 'app'

Should be able to immediately connect to the same node without having
to handle retries while the virtual cluster starts up.

Note that we can extend this to wait for _any_
tenant. The restriction on only waiting on the default_target_cluster
is that we need to thread a little state into the server controller in
order to avoid pitfalls such as holding a lot of connections open for
longer than needed for a tenant that will never become available
because it doesn't exist.

In a future PR, we will set this option by default in the replication-source and replication-target
configuration profiles.

Built on #113666

Informs #111637

Release note: None

Co-authored-by: Steven Danna <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.2.x Flags PRs that need to be backported to 23.2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants