Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachprod: add virtual cluster support to the monitor #111064

Merged

Conversation

renatolabs
Copy link
Contributor

@renatolabs renatolabs commented Sep 21, 2023

This PR makes the roachprod monitor aware of multi-tenancy. This means
that the monitor is able to identify scenarios where multiple tenants
are running on the same VM, and monitor the lifecycle of tenant
processes just like it does for non multi-tenant deployments.

Under the covers, this is achieved by leveraging the
"ROACHPROD_VIRTUAL_CLUSTER" environment variable that is
set when starting cockroach. The monitor then looks for processes
that have that environment variable set, and parses it to identify the
tenant name and instance corresponding to that cockroach process.

This change should go unnoticed by all current monitor callers: tests
that monitor cockroach processes will continue to be notified of
events just like they did before. The only thing that changes is that,
if a test uses roachprod to start and stop tenants, these events will
be available through the monitor.

This PR also adds a stop-tenant command that uses the same
ROACHPROD_TENANT environment variable to allow the caller to
stop a specific tenant process.

Epic: none

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@renatolabs renatolabs force-pushed the rc/roachprod-multitenant-monitor branch 12 times, most recently from 0349e34 to 9cbeda5 Compare September 27, 2023 17:33
@renatolabs renatolabs marked this pull request as ready for review September 27, 2023 19:42
@renatolabs renatolabs requested a review from a team as a code owner September 27, 2023 19:42
@renatolabs renatolabs requested review from herkolategan and smg260 and removed request for a team September 27, 2023 19:42
@renatolabs
Copy link
Contributor Author

Open for review. A few runs so far:

Copy link
Collaborator

@herkolategan herkolategan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Minor nits about tenant vs. virtual cluster naming and a few other things.
:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @renatolabs and @smg260)


-- commits line 38 at r1:
Nit: Should this now be virtual cluster support?


-- commits line 43 at r1:
Nit: Should this now be non virtual cluster deployments?


pkg/cmd/roachprod/main.go line 561 at r1 (raw file):

		}

		// Always pick a random available port when starting tenants. We

Nit: virtual clusters.


pkg/roachprod/multitenant.go line 84 at r1 (raw file):

// StopServiceForVirtualCluster stops SQL instance processes on the virtualCluster given.
func StopServiceForVirtualCluster(
	ctx context.Context, l *logger.Logger, tenantCluster string, stopOpts StopOpts,

Nit: Do we want to update this to virtual cluster as well?


pkg/roachprod/multitenant.go line 99 at r1 (raw file):

// tenant with ID given.
//
// TODO(herko): Allow users to pass in a tenant name.

Nit: Few mentions of tenant here again. As stated in the TODO we'll probably want to be able to pass virtual cluster name in the future directly once processes can start with a virtual cluster name instead of a virtual cluster id.


pkg/roachprod/roachprod.go line 716 at r1 (raw file):

	MaxWait int

	// Options that only apply to StopTenant

Nit: StopVirtualCluster?


pkg/roachprod/install/cluster_synced.go line 403 at r1 (raw file):

// It sends a signal to all processes that have been started with ROACHPROD env
// var and optionally waits until the processes stop. If the virtualClusterLabel
// is not empty, then only tenant processes with a matching label are stopped.

Nit: sql processes.


pkg/roachprod/install/cluster_synced.go line 444 at r1 (raw file):

// for output/logging. If wait is true, the command will wait for the processes
// to exit, up to maxWait seconds.
// TODO(herko): This command does not support virtual clusters yet.

I suppose we can remove this comment now?


pkg/roachprod/install/cluster_synced.go line 647 at r1 (raw file):

	virtualClusterDesc := func(name string, instance int) string {
		if name == SystemInterfaceName {
			return "system tenant"

Nit: Wondering if we should call this system interface?


pkg/roachprod/install/cluster_synced.go line 746 at r1 (raw file):

			)
			if err := errors.CombineErrors(err, result.Err); err != nil {
				err := errors.Wrap(err, "failed to list tenants")

Nit: virtual clusters.


pkg/roachprod/install/cluster_synced.go line 869 at r1 (raw file):

{{ end }}

# make sure all tenant monitors quit when this script exits. In

Nit: sql process / virtual cluster monitors?


pkg/roachprod/install/cluster_synced.go line 939 at r1 (raw file):

						panic(fmt.Errorf("invalid output from monitor: %q", line))
					}
					// Tenant name and instance are the first fields of every

Nit: virtual cluster name.


pkg/roachprod/install/cluster_synced.go line 942 at r1 (raw file):

					// event type.
					name, instanceStr := parts[0], parts[1]
					instance, _ := strconv.Atoi(instanceStr)

Nit: Should we maybe panic like above in the rare case that this string is not an integer?


pkg/roachprod/install/cluster_synced.go line 949 at r1 (raw file):

						}})
					case runningMsg:
						pid := parts[3]

Nit: Should we add a len < 4 panic here as above?


pkg/roachprod/install/cluster_synced.go line 954 at r1 (raw file):

						}})
					case deadMsg:
						exitCode := parts[3]

Nit: Should we add a len < 4 panic here as above?


pkg/roachprod/install/cockroach.go line 602 at r1 (raw file):

}

// TenantInfoFromLabel takes as parameter a tenant label produced with

Nit: VirtualClusterInfoFromLabel ... virtual cluster label.


pkg/roachprod/install/cockroach_test.go line 27 at r1 (raw file):

	}{
		{
			name:               "empty tenant name",

Nit: empty virtual cluster name.


pkg/roachprod/install/cockroach_test.go line 32 at r1 (raw file):

		},
		{
			name:               "system tenant name",

Nit: System interface name?


pkg/roachprod/install/cockroach_test.go line 37 at r1 (raw file):

		},
		{
			name:               "simple app tenant name",

Nit: simple app virtual cluster name.


pkg/roachprod/install/cockroach_test.go line 43 at r1 (raw file):

		},
		{
			name:               "tenant name with hyphens",

Nit: virtual cluster name with hyphens.


pkg/roachprod/install/cockroach_test.go line 58 at r1 (raw file):

			require.NoError(t, err)

			expectedTenantName := tc.virtualClusterName

Nit: expectedVirtualClusterName.

@renatolabs renatolabs force-pushed the rc/roachprod-multitenant-monitor branch 2 times, most recently from de188ba to c8d7695 Compare October 2, 2023 19:40
Copy link
Contributor Author

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @herkolategan and @smg260)


-- commits line 38 at r1:

Previously, herkolategan (Herko Lategan) wrote…

Nit: Should this now be virtual cluster support?

Done.


-- commits line 43 at r1:

Previously, herkolategan (Herko Lategan) wrote…

Nit: Should this now be non virtual cluster deployments?

Done.


pkg/cmd/roachprod/main.go line 561 at r1 (raw file):

Previously, herkolategan (Herko Lategan) wrote…

Nit: virtual clusters.

Done.


pkg/roachprod/multitenant.go line 84 at r1 (raw file):

Previously, herkolategan (Herko Lategan) wrote…

Nit: Do we want to update this to virtual cluster as well?

Done.


pkg/roachprod/multitenant.go line 99 at r1 (raw file):

Previously, herkolategan (Herko Lategan) wrote…

Nit: Few mentions of tenant here again. As stated in the TODO we'll probably want to be able to pass virtual cluster name in the future directly once processes can start with a virtual cluster name instead of a virtual cluster id.

Done.


pkg/roachprod/roachprod.go line 716 at r1 (raw file):

Previously, herkolategan (Herko Lategan) wrote…

Nit: StopVirtualCluster?

Done.


pkg/roachprod/install/cluster_synced.go line 403 at r1 (raw file):

Previously, herkolategan (Herko Lategan) wrote…

Nit: sql processes.

Done.


pkg/roachprod/install/cluster_synced.go line 444 at r1 (raw file):

Previously, herkolategan (Herko Lategan) wrote…

I suppose we can remove this comment now?

Done.


pkg/roachprod/install/cluster_synced.go line 647 at r1 (raw file):

Previously, herkolategan (Herko Lategan) wrote…

Nit: Wondering if we should call this system interface?

Done.


pkg/roachprod/install/cluster_synced.go line 746 at r1 (raw file):

Previously, herkolategan (Herko Lategan) wrote…

Nit: virtual clusters.

Done.


pkg/roachprod/install/cluster_synced.go line 869 at r1 (raw file):

Previously, herkolategan (Herko Lategan) wrote…

Nit: sql process / virtual cluster monitors?

Done.


pkg/roachprod/install/cluster_synced.go line 939 at r1 (raw file):

Previously, herkolategan (Herko Lategan) wrote…

Nit: virtual cluster name.

Done.


pkg/roachprod/install/cluster_synced.go line 942 at r1 (raw file):

Previously, herkolategan (Herko Lategan) wrote…

Nit: Should we maybe panic like above in the rare case that this string is not an integer?

Done.


pkg/roachprod/install/cluster_synced.go line 949 at r1 (raw file):

Previously, herkolategan (Herko Lategan) wrote…

Nit: Should we add a len < 4 panic here as above?

Done.


pkg/roachprod/install/cluster_synced.go line 954 at r1 (raw file):

Previously, herkolategan (Herko Lategan) wrote…

Nit: Should we add a len < 4 panic here as above?

Done.


pkg/roachprod/install/cockroach.go line 602 at r1 (raw file):

Previously, herkolategan (Herko Lategan) wrote…

Nit: VirtualClusterInfoFromLabel ... virtual cluster label.

Done.


pkg/roachprod/install/cockroach_test.go line 27 at r1 (raw file):

Previously, herkolategan (Herko Lategan) wrote…

Nit: empty virtual cluster name.

Done.


pkg/roachprod/install/cockroach_test.go line 32 at r1 (raw file):

Previously, herkolategan (Herko Lategan) wrote…

Nit: System interface name?

Done.


pkg/roachprod/install/cockroach_test.go line 37 at r1 (raw file):

Previously, herkolategan (Herko Lategan) wrote…

Nit: simple app virtual cluster name.

Done.


pkg/roachprod/install/cockroach_test.go line 43 at r1 (raw file):

Previously, herkolategan (Herko Lategan) wrote…

Nit: virtual cluster name with hyphens.

Done.


pkg/roachprod/install/cockroach_test.go line 58 at r1 (raw file):

Previously, herkolategan (Herko Lategan) wrote…

Nit: expectedVirtualClusterName.

Done.

@renatolabs
Copy link
Contributor Author

Rebased and addressed comments. Running another subset of tests just to be sure. If everything is well, plan to merge this tomorrow:

https://teamcity.cockroachdb.com/viewQueued.html?itemId=12021678&tab=queuedBuildOverviewTab

@renatolabs
Copy link
Contributor Author

Build finished without any unexpected failures. Merging. TFTR!

bors r=herkolategan

@renatolabs renatolabs changed the title roachprod: add multi-tenant support to the monitor roachprod: add virtual cluster support to the monitor Oct 3, 2023
@craig
Copy link
Contributor

craig bot commented Oct 3, 2023

Build failed:

@renatolabs
Copy link
Contributor Author

bors r-

@craig
Copy link
Contributor

craig bot commented Oct 5, 2023

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Oct 5, 2023

Build failed:

The help string for this subcommand already mentioned support for the
`tag` parameter; however, that was not implemented.

In this commit, we update the `startInstanceAsSeparateProcessCmd`
struct to accep the `--tag` flag and also apply a small fix to the
help string (app and storage nodes no longer need to be distinct).

Epic: none

Release note: None
This adds a `roachprod stop-sql` command. It is similar to
`roachprod stop` in the sense that it takes a similar set of flags:
the signal to be sent to the processes, whether to wait for the
process to finish, and for how long. However, one crucial difference
is that `roachprod stop` stops *all* processes started by
roachprod (cockroach or otherwise), whereas `stop-sql` was designed
specifically for stopping tenant processes (SQL instances).

To achieve that, we set a `ROACHPROD_VIRTUAL_CLUSTER` environment
variable to the corresponding cockroach process when starting it. This
label is then used to find the correct process to stop when requested.

We also make use of the label to name the systemctl unit when starting
cockroach processes on remote clusters. This allows multiple tenants
to be co-located on the same VM.

Epic: none

Release note: None
This makes the roachprod monitor aware of virtual clusters. This means
that the monitor is able to identify scenarios where multiple
instances are running on the same VM, and monitor the lifecycle of sql
instance processes just like it does for non virtual cluster
deployments.

Under the covers, this is achieved by leveraging the
"ROACHPROD_VIRTUAL_CLUSTER" environment variable that is set when
starting cockroach. The monitor then looks for processes that have
that environment variable set, and parses it to identify the virtual
cluster name and instance corresponding to that cockroach process.

This change should go unnoticed by all current monitor callers: tests
that monitor cockroach processes will continue to be notified of
events just like they did before. The only thing that changes is that,
if a test uses roachprod to start and stop sql instances, these events
will be available through the monitor.

Epic: none

Release note: None
@renatolabs renatolabs force-pushed the rc/roachprod-multitenant-monitor branch from a348e68 to fc4ff31 Compare October 5, 2023 15:07
@renatolabs
Copy link
Contributor Author

bors retry

@craig
Copy link
Contributor

craig bot commented Oct 5, 2023

Build succeeded:

@craig craig bot merged commit 563aa77 into cockroachdb:master Oct 5, 2023
3 checks passed
@renatolabs renatolabs deleted the rc/roachprod-multitenant-monitor branch October 5, 2023 18:59
renatolabs added a commit to renatolabs/cockroach that referenced this pull request Oct 6, 2023
After cockroachdb#111064, we started naming the systemd unit for cockroach
processes differently, since we now might have multiple cockroach
processes in the same VM in the context of SQL server processes and
cluster virtualization.

That change broke a few tests that directly referenced the
`cockroach.service` unit. In this change, we create a util function
that is reponsible for encapsulating that name and replace direct
references to the systemd unit name with calls to this new function.

Fixes: cockroachdb#111902

Release note: None
craig bot pushed a commit that referenced this pull request Oct 6, 2023
111915: roachtest: remove direct references to cockroach.service from tests r=herkolategan a=renatolabs

After #111064, we started naming the systemd unit for cockroach processes differently, since we now might have multiple cockroach processes in the same VM in the context of SQL server processes and cluster virtualization.

That change broke a few tests that directly referenced the `cockroach.service` unit. In this change, we create a util function that is reponsible for encapsulating that name and replace direct references to the systemd unit name with calls to this new function.

Fixes: #111902

Release note: None

111937: kv: add conflicting txn_meta to error message r=nvanbenschoten a=aadityasondhi

This patch adds back the conflicting txn_meta to the error message printed out from a `TransactionRetryError` if available.

Fixes #110689.

Release note: None

Co-authored-by: Renato Costa <[email protected]>
Co-authored-by: Aaditya Sondhi <[email protected]>
renatolabs added a commit to renatolabs/cockroach that referenced this pull request Oct 19, 2023
This call was added during work on cockroachdb#111064 as an attempt to fix issues
during development. It should not be needed for correctness. It also
makes backporting that work to 23.1 not possible, since the
`WaitDelay` API was introduced in Go 1.20.

Epic: none

Release note: None
craig bot pushed a commit that referenced this pull request Oct 20, 2023
112714: roachprod: remove `WaitDelay` call in `localSession` r=herkolategan a=renatolabs

This call was added during work on #111064 as an attempt to fix issues during development. It should not be needed for correctness. It also makes backporting that work to 23.1 not possible, since the `WaitDelay` API was introduced in Go 1.20.

Epic: none

Release note: None

Co-authored-by: Renato Costa <[email protected]>
blathers-crl bot pushed a commit that referenced this pull request Oct 20, 2023
This call was added during work on #111064 as an attempt to fix issues
during development. It should not be needed for correctness. It also
makes backporting that work to 23.1 not possible, since the
`WaitDelay` API was introduced in Go 1.20.

Epic: none

Release note: None
renatolabs added a commit to renatolabs/cockroach that referenced this pull request Oct 20, 2023
After cockroachdb#111064, we started naming the systemd unit for cockroach
processes differently, since we now might have multiple cockroach
processes in the same VM in the context of SQL server processes and
cluster virtualization.

That change broke a few tests that directly referenced the
`cockroach.service` unit. In this change, we create a util function
that is reponsible for encapsulating that name and replace direct
references to the systemd unit name with calls to this new function.

Fixes: cockroachdb#111902

Release note: None
renatolabs added a commit to renatolabs/cockroach that referenced this pull request Oct 20, 2023
This call was added during work on cockroachdb#111064 as an attempt to fix issues
during development. It should not be needed for correctness. It also
makes backporting that work to 23.1 not possible, since the
`WaitDelay` API was introduced in Go 1.20.

Epic: none

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants