Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multitenant: add SQL server startup guardrails #94973

Merged
merged 1 commit into from
Feb 1, 2023

Conversation

healthy-pod
Copy link
Contributor

@healthy-pod healthy-pod commented Jan 10, 2023

This code change prevents a SQL server from starting if its
binary version is less than the tenant's active version.

It makes RowDecoder.DecodeRow thread safe by making it
accept a *tree.DatumAlloc and use a new alloc if its nil.

The check runs after we run permanent upgrades because
that's when the version setting is written to the settings table
of the system tenant (and we want to apply this check to both system
and secondary tenants). To guard against future changes to where we
set the version, if GetClusterVersionFromStorage doesn't find a value for
the version setting it will return an error because this means the
version is not set yet.

Release note: None
Epic: CRDB-10829

@healthy-pod healthy-pod requested review from knz and ajstorm January 10, 2023 03:52
@healthy-pod healthy-pod requested review from a team as code owners January 10, 2023 03:52
@cockroach-teamcity
Copy link
Member

This change is Reviewable


// Prevent a secondary tenant's server from starting if its binary version is too low
// for the current tenant cluster version.
if !s.execCfg.Codec.ForSystemTenant() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to structure this code to not need this codec check here.
The check should be the same for the system tenant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually based on how you structured the code, i'd say just drop the check and the remainder of the code is correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@ajstorm ajstorm requested a review from knz January 10, 2023 14:56
Copy link
Collaborator

@ajstorm ajstorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code change looks good, but it could use a test. The easiest place to put the test would be in multitenant-upgrade, but you may want to wait to add that until I get #94998 merged as I've reworked that test a bit.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @healthy-pod and @knz)

@healthy-pod
Copy link
Contributor Author

The code change looks good, but it could use a test. The easiest place to put the test would be in multitenant-upgrade, but you may want to wait to add that until I get #94998 merged as I've reworked that test a bit.

I agree and I had a question about this part: only the most recent binary has this code change and makes this check so how would we go around testing this with some binary with an older version?

@healthy-pod healthy-pod changed the title multitenant: secondary tenant SQL server startup guardrails multitenant: add SQL server startup guardrails Jan 10, 2023
Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 2 files at r1, 1 of 1 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @healthy-pod)

Copy link
Collaborator

@ajstorm ajstorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@healthy-pod, as mentioned yesterday, you should be able to fake the binary version with a test like we're doing here: https://github.com/cockroachdb/cockroach/pull/94998/files#diff-6e04200b0921d3dce849044b96b198a595d76a9cfda91dcaa4b35076fd392998R128. Let me know when that's done and I'll have another look. Thanks!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @healthy-pod)

@healthy-pod healthy-pod force-pushed the startup-gr branch 2 times, most recently from b8afe42 to 1a44dc0 Compare January 16, 2023 16:28
@healthy-pod healthy-pod requested review from ajstorm and knz January 16, 2023 16:30
@healthy-pod healthy-pod force-pushed the startup-gr branch 4 times, most recently from e4d1db8 to fc07b22 Compare January 16, 2023 18:47
@healthy-pod
Copy link
Contributor Author

ready for another pass :)

Copy link
Collaborator

@ajstorm ajstorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modulo my questions, this looks good to me. You'll probably want @ajwerner to have a look as well though, as he's a real upgrades expert. I just play one on TV.

Reviewed 1 of 5 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @healthy-pod and @knz)


pkg/server/server_sql.go line 1494 at r3 (raw file):

	if s.execCfg.Settings.Version.BinaryVersion().Less(tenantActiveVersion) {
		return errors.Newf("preventing SQL server from starting because its binary version "+
			"is too low for the tenant active version: server binary version = %v, tenant active version = %v",

nit: We might want to consider wrapping this error message with a hint (errors.WithHint()) to explain that the solution to this problem is to move to a higher tenant binary. Should be obvious to most, but maybe not all.


pkg/server/settingswatcher/settings_watcher.go line 459 at r3 (raw file):

	row, err := db.Get(context.Background(), key)
	if err != nil {
		log.Fatalf(ctx, "failed to get tenant cluster version row: %v", err)

Is there something more user-friendly we can return here? If this happens, the tenant is corrupted, right? Is there any self-service option that the user can perform? I'd imagine not, so perhaps the action they'll need to take is to open a support ticket.


pkg/ccl/serverccl/server_startup_guardrails_test.go line 62 at r3 (raw file):

			test.storageBinaryVersion,
			test.storageBinaryMinSupportedVersion,
			false, /* initializeVersion */

Hmm, I'm surprised that we're not initializing the version here. If we don't initialize the version, what does the active cluster version get set to?

@healthy-pod healthy-pod requested a review from ajwerner January 19, 2023 16:10
Copy link
Contributor Author

@healthy-pod healthy-pod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @ajwerner, and @knz)


pkg/ccl/serverccl/server_startup_guardrails_test.go line 62 at r3 (raw file):

Previously, ajstorm (Adam Storm) wrote…

Hmm, I'm surprised that we're not initializing the version here. If we don't initialize the version, what does the active cluster version get set to?

It gets set to the binary version and if we attempt to initialize it we get: cannot initialize version to 21.1 because already set to: 21.2. Note that binary version is 21.2 and min supported is 21.1. I am still trying to trace how it gets to 21.2 but I am guessing serverutils.StartServer somehow ends initializing the version regardless.

@healthy-pod
Copy link
Contributor Author

healthy-pod commented Jan 19, 2023

So the ActiveVersion in storageSettings doesn't get initialized when we call MakeTestingClusterSettingsWithVersions (because we set initializeVersion to false) but it gets initialized to 20.2/minBinary here, then get re-initialized to 21.2/binaryVersion here. This is how it ends at binaryVersion when StartServer returns

Copy link
Contributor Author

@healthy-pod healthy-pod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @ajwerner, and @knz)


pkg/server/server_sql.go line 1494 at r3 (raw file):

Previously, ajstorm (Adam Storm) wrote…

nit: We might want to consider wrapping this error message with a hint (errors.WithHint()) to explain that the solution to this problem is to move to a higher tenant binary. Should be obvious to most, but maybe not all.

Done


pkg/server/settingswatcher/settings_watcher.go line 459 at r3 (raw file):

Previously, ajstorm (Adam Storm) wrote…

Is there something more user-friendly we can return here? If this happens, the tenant is corrupted, right? Is there any self-service option that the user can perform? I'd imagine not, so perhaps the action they'll need to take is to open a support ticket.

Since this is a fatal error, the logs will have something like the following right?

F230119 22:39:04.501381 15 clusterversion/setting.go:124  [T1] 3 !This node experienced a fatal error (printed above), and as a result the
F230119 22:39:04.501381 15 clusterversion/setting.go:124  [T1] 3 !process is terminating.
F230119 22:39:04.501381 15 clusterversion/setting.go:124  [T1] 3 !
F230119 22:39:04.501381 15 clusterversion/setting.go:124  [T1] 3 !Fatal errors can occur due to faulty hardware (disks, memory, clocks) or a
F230119 22:39:04.501381 15 clusterversion/setting.go:124  [T1] 3 !problem in CockroachDB. With your help, the support team at Cockroach Labs
F230119 22:39:04.501381 15 clusterversion/setting.go:124  [T1] 3 !will try to determine the root cause, recommend next steps, and we can
F230119 22:39:04.501381 15 clusterversion/setting.go:124  [T1] 3 !improve CockroachDB based on your report.
F230119 22:39:04.501381 15 clusterversion/setting.go:124  [T1] 3 !
F230119 22:39:04.501381 15 clusterversion/setting.go:124  [T1] 3 !Please submit a crash report by following the instructions here:
F230119 22:39:04.501381 15 clusterversion/setting.go:124  [T1] 3 !
F230119 22:39:04.501381 15 clusterversion/setting.go:124  [T1] 3 !    https://github.com/cockroachdb/cockroach/issues/new/choose
F230119 22:39:04.501381 15 clusterversion/setting.go:124  [T1] 3 !
F230119 22:39:04.501381 15 clusterversion/setting.go:124  [T1] 3 !If you would rather not post publicly, please contact us directly at:
F230119 22:39:04.501381 15 clusterversion/setting.go:124  [T1] 3 !
F230119 22:39:04.501381 15 clusterversion/setting.go:124  [T1] 3 !    [email protected]

@healthy-pod healthy-pod requested a review from ajstorm January 19, 2023 23:43
Copy link
Contributor Author

@healthy-pod healthy-pod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @ajwerner, and @knz)


pkg/ccl/serverccl/server_startup_guardrails_test.go line 62 at r3 (raw file):

Previously, ajstorm (Adam Storm) wrote…

Thanks for looking into this. Your explanation resolves my question. I'm wondering though, if we want to sprinkle a few comments down below to explain how this test works. For example, it might not be immediately obvious to future readers that the tenant below gets created with TLV = SBV (I'm assuming, because it's created and assigned the storage cluster's active version at the time). The more context you add down there, the less time future engineers will need to spend understanding this code before they can modify it.

All that being said, I'm fine with this PR now, but I do think that we still need @ajwerner to have a look before you can merge.

I added more context, let me know if I need to add something else.

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @healthy-pod, and @knz)


pkg/server/settings_cache.go line 147 at r6 (raw file):

) error {
	dec := settingswatcher.MakeRowDecoder(codec)
	var alloc *tree.DatumAlloc

nit: this should allocate an alloc or explicitly pass nil. I find declaring this variable to be confusing.


pkg/server/settingswatcher/settings_watcher.go line 48 at r6 (raw file):

	dec              RowDecoder
	storage          Storage
	allocForHandleKV tree.DatumAlloc

nit: I don't think this has to be on the struct -- it can just be captured by the closure if you wanted

@ajstorm ajstorm requested a review from ajwerner January 23, 2023 21:45
Copy link
Collaborator

@ajstorm ajstorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good too, mod nits.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @healthy-pod, and @knz)


pkg/ccl/serverccl/server_startup_guardrails_test.go line 68 at r6 (raw file):

		// true with an assertion below. This is needed because in some test cases we want to ensure
		// the active version of this server is greater than the binary version of the tenant. By knowing
		// that the SBV is higher than TBV and SLV is equal to SBV we can be sure that SLV is higher than

Nit: Nice comment. My only concern is that SBV, TBV, SLV and TLV are not widely used terms. You might want to spell them out here instead of using acronyms.

@healthy-pod
Copy link
Contributor Author

Addressed the nits.
TFTRs!

bors r+

@craig
Copy link
Contributor

craig bot commented Jan 24, 2023

Build failed:

@healthy-pod
Copy link
Contributor Author

The test I added here failed when #94774 was merged. I made some changes to the test and the tenant creation code to fix the test and make sure there is as few moving parts as possible.

PTAL

@healthy-pod healthy-pod force-pushed the startup-gr branch 2 times, most recently from 7ea194d to 00eb938 Compare January 31, 2023 01:52
Copy link
Collaborator

@ajstorm ajstorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of my comments were addressed, so I'm good now.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @healthy-pod, and @knz)


pkg/ccl/serverccl/server_startup_guardrails_test.go line 68 at r6 (raw file):

Previously, ajstorm (Adam Storm) wrote…

Nit: Nice comment. My only concern is that SBV, TBV, SLV and TLV are not widely used terms. You might want to spell them out here instead of using acronyms.

LGTM.


pkg/ccl/serverccl/server_startup_guardrails_test.go line 96 at r9 (raw file):

		)

		// The tenant will be created with an active version equal to the version corresponding to TenantLogicalVersionKey.

Extremely minor nit: try to keep comment lines to 80 characters or less.

@healthy-pod healthy-pod force-pushed the startup-gr branch 2 times, most recently from d8170d9 to 5e29d40 Compare January 31, 2023 20:51
Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just nits, otherwise LGTM

) (clusterversion.ClusterVersion, error) {
indexPrefix := s.codec.IndexPrefix(keys.SettingsTableID, uint32(1))
key := encoding.EncodeUvarintAscending(encoding.EncodeStringAscending(indexPrefix, "version"), uint64(0))
log.Infof(ctx, "getting tenant cluster version using key: '%v'", roachpb.Key(key))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: probably want to remove this or put it behind some vmodule before merging.

@@ -144,7 +144,7 @@ func initializeCachedSettings(
) error {
dec := settingswatcher.MakeRowDecoder(codec)
for _, kv := range kvs {
settings, val, _, err := dec.DecodeRow(kv)
settings, val, _, err := dec.DecodeRow(kv, nil)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: name the nil

@@ -777,6 +779,10 @@ var versionsSingleton = func() keyedVersions {
// simply need to check is that the cluster has upgraded to 23.1.
var V23_1 = versionsSingleton[len(versionsSingleton)-1].Key

const (
BinaryMinSupportedVersionKey = V22_1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dt do you see any reason to not expose this like this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// BinaryMinSupportedVersion implements the Handle interface.
func (v *handleImpl) BinaryMinSupportedVersion() roachpb.Version {
return v.binaryMinSupportedVersion
}
is the way to do it I do believe

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw I did not end using BinaryMinSupportedVersionKey because it's different from the tenant creation min version but I decided to keep this, assuming we will unify them? I can remove it. We need the key specifically rather than the version because we map the key to the initial values we want from https://github.com/cockroachdb/cockroach/blob/af0beec136262540f2e777c1b52a93ad21ac7481/pkg/sql/catalog/bootstrap/previous_releases.go#L26

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I see. I guess what we could do is just write a test that asserts that the key here maps to the min binary version or something like that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and in a similar test (or in the same one) ensure that when BinaryVersionKey is passed to ByKey it returns the same version as binaryVersion?

@@ -236,7 +238,7 @@ func (s *SettingsWatcher) handleKV(
name, val, tombstone, err := s.dec.DecodeRow(roachpb.KeyValue{
Key: kv.Key,
Value: kv.Value,
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I was saying here is that you could create a long-lived alloc because this is single-threaded. It doesn't really matter I guess.


// GetFn is a function to retrieve data from the kv store. It is implemented
// by both (*kv.DB).Get and (*kv.Txn).Get.
type GetFn = func(context.Context, interface{}) (kv.KeyValue, error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given what we now know, maybe we just change this to take a *kv.Txn or some interface called Txn which just has this Get method if you don't want to depend on the concrete impl.

@healthy-pod healthy-pod force-pushed the startup-gr branch 2 times, most recently from af0beec to 42b30ea Compare January 31, 2023 22:44
This code change prevents a SQL server from starting if its
binary version is less than the tenant's active version.

It makes `RowDecoder.DecodeRow` thread safe by making it
accept a `*tree.DatumAlloc` and use a new alloc if its nil.

The check runs after we run permanent upgrades because
that's when the version setting is written to the settings table
of the system tenant (and we want to apply this check to both system
and secondary tenants). To guard against future changes to where we
set the version, if GetClusterVersionFromStorage doesn't find a value for
the version setting it will return an error because this means the
version is not set yet.

Release note: None
Epic: CRDB-10829
@healthy-pod
Copy link
Contributor Author

TFTsecondRs!
bors r=[ajstorm,ajwerner]

@craig
Copy link
Contributor

craig bot commented Feb 1, 2023

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Feb 1, 2023

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants