-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multitenant: add SQL server startup guardrails #94973
Conversation
pkg/server/server_sql.go
Outdated
|
||
// Prevent a secondary tenant's server from starting if its binary version is too low | ||
// for the current tenant cluster version. | ||
if !s.execCfg.Codec.ForSystemTenant() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to structure this code to not need this codec check here.
The check should be the same for the system tenant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually based on how you structured the code, i'd say just drop the check and the remainder of the code is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code change looks good, but it could use a test. The easiest place to put the test would be in multitenant-upgrade, but you may want to wait to add that until I get #94998 merged as I've reworked that test a bit.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @healthy-pod and @knz)
81aae3f
to
709565a
Compare
I agree and I had a question about this part: only the most recent binary has this code change and makes this check so how would we go around testing this with some binary with an older version? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 2 files at r1, 1 of 1 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @healthy-pod)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@healthy-pod, as mentioned yesterday, you should be able to fake the binary version with a test like we're doing here: https://github.com/cockroachdb/cockroach/pull/94998/files#diff-6e04200b0921d3dce849044b96b198a595d76a9cfda91dcaa4b35076fd392998R128. Let me know when that's done and I'll have another look. Thanks!
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @healthy-pod)
b8afe42
to
1a44dc0
Compare
e4d1db8
to
fc07b22
Compare
ready for another pass :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modulo my questions, this looks good to me. You'll probably want @ajwerner to have a look as well though, as he's a real upgrades expert. I just play one on TV.
Reviewed 1 of 5 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @healthy-pod and @knz)
pkg/server/server_sql.go
line 1494 at r3 (raw file):
if s.execCfg.Settings.Version.BinaryVersion().Less(tenantActiveVersion) { return errors.Newf("preventing SQL server from starting because its binary version "+ "is too low for the tenant active version: server binary version = %v, tenant active version = %v",
nit: We might want to consider wrapping this error message with a hint (errors.WithHint()
) to explain that the solution to this problem is to move to a higher tenant binary. Should be obvious to most, but maybe not all.
pkg/server/settingswatcher/settings_watcher.go
line 459 at r3 (raw file):
row, err := db.Get(context.Background(), key) if err != nil { log.Fatalf(ctx, "failed to get tenant cluster version row: %v", err)
Is there something more user-friendly we can return here? If this happens, the tenant is corrupted, right? Is there any self-service option that the user can perform? I'd imagine not, so perhaps the action they'll need to take is to open a support ticket.
pkg/ccl/serverccl/server_startup_guardrails_test.go
line 62 at r3 (raw file):
test.storageBinaryVersion, test.storageBinaryMinSupportedVersion, false, /* initializeVersion */
Hmm, I'm surprised that we're not initializing the version here. If we don't initialize the version, what does the active cluster version get set to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @ajwerner, and @knz)
pkg/ccl/serverccl/server_startup_guardrails_test.go
line 62 at r3 (raw file):
Previously, ajstorm (Adam Storm) wrote…
Hmm, I'm surprised that we're not initializing the version here. If we don't initialize the version, what does the active cluster version get set to?
It gets set to the binary version and if we attempt to initialize it we get: cannot initialize version to 21.1 because already set to: 21.2
. Note that binary version is 21.2
and min supported
is 21.1
. I am still trying to trace how it gets to 21.2
but I am guessing serverutils.StartServer
somehow ends initializing the version regardless.
So the |
fc07b22
to
d8fdf12
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @ajwerner, and @knz)
pkg/server/server_sql.go
line 1494 at r3 (raw file):
Previously, ajstorm (Adam Storm) wrote…
nit: We might want to consider wrapping this error message with a hint (
errors.WithHint()
) to explain that the solution to this problem is to move to a higher tenant binary. Should be obvious to most, but maybe not all.
Done
pkg/server/settingswatcher/settings_watcher.go
line 459 at r3 (raw file):
Previously, ajstorm (Adam Storm) wrote…
Is there something more user-friendly we can return here? If this happens, the tenant is corrupted, right? Is there any self-service option that the user can perform? I'd imagine not, so perhaps the action they'll need to take is to open a support ticket.
Since this is a fatal error, the logs will have something like the following right?
F230119 22:39:04.501381 15 clusterversion/setting.go:124 [T1] 3 !This node experienced a fatal error (printed above), and as a result the
F230119 22:39:04.501381 15 clusterversion/setting.go:124 [T1] 3 !process is terminating.
F230119 22:39:04.501381 15 clusterversion/setting.go:124 [T1] 3 !
F230119 22:39:04.501381 15 clusterversion/setting.go:124 [T1] 3 !Fatal errors can occur due to faulty hardware (disks, memory, clocks) or a
F230119 22:39:04.501381 15 clusterversion/setting.go:124 [T1] 3 !problem in CockroachDB. With your help, the support team at Cockroach Labs
F230119 22:39:04.501381 15 clusterversion/setting.go:124 [T1] 3 !will try to determine the root cause, recommend next steps, and we can
F230119 22:39:04.501381 15 clusterversion/setting.go:124 [T1] 3 !improve CockroachDB based on your report.
F230119 22:39:04.501381 15 clusterversion/setting.go:124 [T1] 3 !
F230119 22:39:04.501381 15 clusterversion/setting.go:124 [T1] 3 !Please submit a crash report by following the instructions here:
F230119 22:39:04.501381 15 clusterversion/setting.go:124 [T1] 3 !
F230119 22:39:04.501381 15 clusterversion/setting.go:124 [T1] 3 ! https://github.com/cockroachdb/cockroach/issues/new/choose
F230119 22:39:04.501381 15 clusterversion/setting.go:124 [T1] 3 !
F230119 22:39:04.501381 15 clusterversion/setting.go:124 [T1] 3 !If you would rather not post publicly, please contact us directly at:
F230119 22:39:04.501381 15 clusterversion/setting.go:124 [T1] 3 !
F230119 22:39:04.501381 15 clusterversion/setting.go:124 [T1] 3 ! [email protected]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @ajwerner, and @knz)
pkg/ccl/serverccl/server_startup_guardrails_test.go
line 62 at r3 (raw file):
Previously, ajstorm (Adam Storm) wrote…
Thanks for looking into this. Your explanation resolves my question. I'm wondering though, if we want to sprinkle a few comments down below to explain how this test works. For example, it might not be immediately obvious to future readers that the tenant below gets created with TLV = SBV (I'm assuming, because it's created and assigned the storage cluster's active version at the time). The more context you add down there, the less time future engineers will need to spend understanding this code before they can modify it.
All that being said, I'm fine with this PR now, but I do think that we still need @ajwerner to have a look before you can merge.
I added more context, let me know if I need to add something else.
11f3fdc
to
a9cefae
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajstorm, @healthy-pod, and @knz)
pkg/server/settings_cache.go
line 147 at r6 (raw file):
) error { dec := settingswatcher.MakeRowDecoder(codec) var alloc *tree.DatumAlloc
nit: this should allocate an alloc or explicitly pass nil. I find declaring this variable to be confusing.
pkg/server/settingswatcher/settings_watcher.go
line 48 at r6 (raw file):
dec RowDecoder storage Storage allocForHandleKV tree.DatumAlloc
nit: I don't think this has to be on the struct -- it can just be captured by the closure if you wanted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm good too, mod nits.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @healthy-pod, and @knz)
pkg/ccl/serverccl/server_startup_guardrails_test.go
line 68 at r6 (raw file):
// true with an assertion below. This is needed because in some test cases we want to ensure // the active version of this server is greater than the binary version of the tenant. By knowing // that the SBV is higher than TBV and SLV is equal to SBV we can be sure that SLV is higher than
Nit: Nice comment. My only concern is that SBV, TBV, SLV and TLV are not widely used terms. You might want to spell them out here instead of using acronyms.
a9cefae
to
c282dfe
Compare
Addressed the nits. bors r+ |
Build failed: |
c282dfe
to
d7c6f0c
Compare
d7c6f0c
to
d418a4c
Compare
The test I added here failed when #94774 was merged. I made some changes to the test and the tenant creation code to fix the test and make sure there is as few moving parts as possible. PTAL |
7ea194d
to
00eb938
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of my comments were addressed, so I'm good now.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @healthy-pod, and @knz)
pkg/ccl/serverccl/server_startup_guardrails_test.go
line 68 at r6 (raw file):
Previously, ajstorm (Adam Storm) wrote…
Nit: Nice comment. My only concern is that SBV, TBV, SLV and TLV are not widely used terms. You might want to spell them out here instead of using acronyms.
LGTM.
pkg/ccl/serverccl/server_startup_guardrails_test.go
line 96 at r9 (raw file):
) // The tenant will be created with an active version equal to the version corresponding to TenantLogicalVersionKey.
Extremely minor nit: try to keep comment lines to 80 characters or less.
d8170d9
to
5e29d40
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just nits, otherwise LGTM
) (clusterversion.ClusterVersion, error) { | ||
indexPrefix := s.codec.IndexPrefix(keys.SettingsTableID, uint32(1)) | ||
key := encoding.EncodeUvarintAscending(encoding.EncodeStringAscending(indexPrefix, "version"), uint64(0)) | ||
log.Infof(ctx, "getting tenant cluster version using key: '%v'", roachpb.Key(key)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: probably want to remove this or put it behind some vmodule before merging.
pkg/server/settings_cache.go
Outdated
@@ -144,7 +144,7 @@ func initializeCachedSettings( | |||
) error { | |||
dec := settingswatcher.MakeRowDecoder(codec) | |||
for _, kv := range kvs { | |||
settings, val, _, err := dec.DecodeRow(kv) | |||
settings, val, _, err := dec.DecodeRow(kv, nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: name the nil
@@ -777,6 +779,10 @@ var versionsSingleton = func() keyedVersions { | |||
// simply need to check is that the cluster has upgraded to 23.1. | |||
var V23_1 = versionsSingleton[len(versionsSingleton)-1].Key | |||
|
|||
const ( | |||
BinaryMinSupportedVersionKey = V22_1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dt do you see any reason to not expose this like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cockroach/pkg/clusterversion/clusterversion.go
Lines 241 to 244 in b258d20
// BinaryMinSupportedVersion implements the Handle interface. | |
func (v *handleImpl) BinaryMinSupportedVersion() roachpb.Version { | |
return v.binaryMinSupportedVersion | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fwiw I did not end using BinaryMinSupportedVersionKey
because it's different from the tenant creation min version but I decided to keep this, assuming we will unify them? I can remove it. We need the key specifically rather than the version because we map the key to the initial values we want from https://github.com/cockroachdb/cockroach/blob/af0beec136262540f2e777c1b52a93ad21ac7481/pkg/sql/catalog/bootstrap/previous_releases.go#L26
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, I see. I guess what we could do is just write a test that asserts that the key here maps to the min binary version or something like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and in a similar test (or in the same one) ensure that when BinaryVersionKey is passed to ByKey
it returns the same version as binaryVersion?
@@ -236,7 +238,7 @@ func (s *SettingsWatcher) handleKV( | |||
name, val, tombstone, err := s.dec.DecodeRow(roachpb.KeyValue{ | |||
Key: kv.Key, | |||
Value: kv.Value, | |||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I was saying here is that you could create a long-lived alloc
because this is single-threaded. It doesn't really matter I guess.
|
||
// GetFn is a function to retrieve data from the kv store. It is implemented | ||
// by both (*kv.DB).Get and (*kv.Txn).Get. | ||
type GetFn = func(context.Context, interface{}) (kv.KeyValue, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given what we now know, maybe we just change this to take a *kv.Txn
or some interface called Txn
which just has this Get
method if you don't want to depend on the concrete impl.
af0beec
to
42b30ea
Compare
This code change prevents a SQL server from starting if its binary version is less than the tenant's active version. It makes `RowDecoder.DecodeRow` thread safe by making it accept a `*tree.DatumAlloc` and use a new alloc if its nil. The check runs after we run permanent upgrades because that's when the version setting is written to the settings table of the system tenant (and we want to apply this check to both system and secondary tenants). To guard against future changes to where we set the version, if GetClusterVersionFromStorage doesn't find a value for the version setting it will return an error because this means the version is not set yet. Release note: None Epic: CRDB-10829
42b30ea
to
6093dd3
Compare
TFTsecondRs! |
Build failed (retrying...): |
Build succeeded: |
This code change prevents a SQL server from starting if its
binary version is less than the tenant's active version.
It makes
RowDecoder.DecodeRow
thread safe by making itaccept a
*tree.DatumAlloc
and use a new alloc if its nil.The check runs after we run permanent upgrades because
that's when the version setting is written to the settings table
of the system tenant (and we want to apply this check to both system
and secondary tenants). To guard against future changes to where we
set the version, if GetClusterVersionFromStorage doesn't find a value for
the version setting it will return an error because this means the
version is not set yet.
Release note: None
Epic: CRDB-10829