pkg/migration: support waiting for minimum engine version #76085

nicktrav · 2022-02-04T20:55:18Z

As part of the MVCC Bulk Operations project, Pebble range keys will
need to be enabled on all engines of a cluster before nodes can start
using the feature to read and write SSTables that contain the range key
features (a backward-incompatible change).

Adding a cluster version is necessary, but not sufficient in
guaranteeing that nodes are ready to generate, but importantly receive
SSTabbles with range key support. Specifically, there exists a race
condition where nodes are told to update their engines as part of the
existing Pebble major format update process, but there is no
coordination between the nodes. One node (a sender) may complete its
engine upgrade and enable the new SSTable features before another node
(the receiver). The latter will panic on receipt of an SSTable with the
newer features written by the former.

Add an server RPC endpoint that provides a means of waiting on a node to
update its store to a version that is compatible with a cluster version.
This endpoint is used as part of a system migration to ensure that all
nodes in a cluster are running with an engine version that is at least
compatible with a given cluster version.

Expose the table format major version on storage.Engine. This will be
used elsewhere in Cockroach (for example, SSTable generation for ingest
and backup).

Add a WaitForCompatibleEngineVersion function on the storage.Engine
interface that provides a mechanism to block until an engine is running
at a format major version that compatible with a given cluster version.

Expose the engine format major version as a storage.TestingKnob to
allow tests to alter the Pebble format major version.

Add a new cluster version to coordinate the upgrade of all engines in a
cluster to pebble.FormatBlockPropertyCollector (Pebble,v1), and the
system migration required for coordinating the engine upgrade to the
latest Pebble table format version.

This patch also fixes an existing issue where a node may write SSTables
with block properties as part of a backup that are then ingested by an
older node. This patch provides the infrastructure necessary for making
these "cluster-external" SSTable operations engine-aware. Nodes should
only use a table format version that other nodes in the cluster
understand.

Informs cockroachdb/pebble#1339.

Release note: None

cockroach-teamcity · 2022-02-04T20:55:25Z

This change is

jbowens

If I understand correctly, the reason we need additional infrastructure here is because we cannot satisfy this requirement of cluster version feature gating:

cockroach/pkg/clusterversion/cockroach_versions.go

Lines 50 to 84 in fc1957a

    
           //     At the same time, with requests/RPCs originating at other crdb nodes, the 
        
           //     initiator of the request gets to decide what's supported. A node should 
        
           //     not refuse functionality on the grounds that its view of the version gate 
        
           //     is as yet inactive. Consider the sender: 
        
           // 
        
           //      func invokeSomeRPC(req) { 
        
           //	        if (specific-version is active) { 
        
           //              // Like mentioned above, this implies that all nodes in the 
        
           //              // cluster are running binaries that can handle this new 
        
           //              // feature. We may have learned about this fact before the 
        
           //              // node on the other end. This is due to the fact that migration 
        
           //              // manager informs each node about the specific-version being 
        
           //              // activated active concurrently. See BumpClusterVersion for 
        
           //              // where that happens. Still, it's safe for us to enable the new 
        
           //              // feature flags as we trust the recipient to know how to deal 
        
           //              // with it. 
        
           //		        req.NewFeatureFlag = true 
        
           //	        } 
        
           //	        send(req) 
        
           //      } 
        
           // 
        
           //    And consider the recipient: 
        
           // 
        
           //     func someRPC(req) { 
        
           //         if !req.NewFeatureFlag { 
        
           //             // Legacy behavior... 
        
           //         } 
        
           //         // There's no need to even check if the specific-version is active. 
        
           //         // If the flag is enabled, the specific-version must have been 
        
           //         // activated, even if we haven't yet heard about it (we will pretty 
        
           //         // soon). 
        
           //     } 
        
           // 
        
           //     See clusterversion.Handle.IsActive and usage of some existing versions 
        
           //     below for more clues on the matter.

When it comes to low-level engine features, the receiver cannot respect request-level feature requests until its own engine is upgraded.

My understanding is still a little muddled though. In pkg/clusterversion there's:

// Handle is the interface through which callers access the active cluster
// version and this binary's version details.
type Handle interface {
	// ActiveVersion returns the cluster's current active version: the minimum
	// cluster version the caller may assume is in effect.
	//
	// ActiveVersion fatals if the cluster version setting has not been
	// initialized (through `Initialize()`).
	ActiveVersion(context.Context) ClusterVersion

	// ...
}

I'm not sure how exactly to interpret "the minimum cluster version the caller may assume is in effect." The way I thought all this worked was the cluster starts at version X. Every node applies migrations for version X+1. Then, every node calls WriteClusterVersion/(storage.Engine).SetMinVersion to persist the new active cluster version of X+1. If that's the case, we could put the ratcheting to pebble.FormatBlockPropertyCollector in a migration that precedes clusterversion.PebbleFormatVersionBlockProperties. That would ensure that a PebbleFormatVersionBlockProperties cluster version implies all stores are running the pebble.FormatBlockPropertyCollector format major version.

Reviewed 3 of 16 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal and @sumeerbhola)

nicktrav

When it comes to low-level engine features, the receiver cannot respect request-level feature requests until its own engine is upgraded.

Yeah, this was the motivation. It was the following line that had me worried:

we will pretty soon

The word "soon". If the receiver was seeing an SSTable with a new table format, I'm not sure it's enough to just accept it without having ratcheted the store, even if the binary has the version to support it?

There was also this comment:

	// If this returns true then all nodes in the cluster will eventually see
	// this version. However, this is not atomic because version gates (for a
	// given version) are pushed through to each node concurrently. Because of
	// this, nodes should not be gating proper handling of remotely initiated
	// requests that their binary knows how to handle on this state. The
	// following example shows why this is important:

The word "eventually". Which seems particularly relevant to the backup and ingestion case. My thinking was that it's necessary that IsActive returns true at the desired version, but not sufficient, in that while a node may be running a new binary that supports the newer table version, internal to Pebble (both on disk and the fields on the *DB in memory) we're running through the code-paths for the old version, which could choke on the newer features.

If that's the case, we could put the ratcheting to pebble.FormatBlockPropertyCollector in a migration that precedes clusterversion.PebbleFormatVersionBlockProperties. That would ensure that a PebbleFormatVersionBlockProperties cluster version implies all stores are running the pebble.FormatBlockPropertyCollector format major version.

Same thing here. Isn't this only saying that the binary version is up-to-date? Isn't there still an element of asynchronicity in the way in which the store is ratcheted?

In talking with @sumeerbhola and @itsbilal, it sounded as if we needed some form of "barrier", after which point we know for certain that all stores have actually been ratcheted up. The two-phase version approach was floated, but the "eventually" / "soon" thing had me questioning it.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal and @sumeerbhola)

As part of the MVCC Bulk Operations project, [Pebble range keys][1] will need to be enabled on _all_ engines of a cluster before nodes can start using the feature to read and write SSTables that contain the range key features (a backward-incompatible change). Adding a cluster version is necessary, but not sufficient in guaranteeing that nodes are ready to generate, but importantly _receive_ SSTabbles with range key support. Specifically, there exists a race condition where nodes are told to update their engines as part of the existing Pebble major format update process, but there is no coordination between the nodes. One node (a sender) may complete its engine upgrade and enable the new SSTable features _before_ another node (the receiver). The latter will panic on receipt of an SSTable with the newer features written by the former. Add an server RPC endpoint that provides a means of waiting on a node to update its store to a version that is compatible with a cluster version. This endpoint is used as part of a system migration to ensure that all nodes in a cluster are running with an engine version that is at least compatible with a given cluster version. Expose the table format major version on `storage.Engine`. This will be used elsewhere in Cockroach (for example, SSTable generation for ingest and backup). Add a `WaitForCompatibleEngineVersion` function on the `storage.Engine` interface that provides a mechanism to block until an engine is running at a format major version that compatible with a given cluster version. Expose the engine format major version as a `storage.TestingKnob` to allow tests to alter the Pebble format major version. Add a new cluster version to coordinate the upgrade of all engines in a cluster to `pebble.FormatBlockPropertyCollector` (Pebble,v1), and the system migration required for coordinating the engine upgrade to the latest Pebble table format version. This patch also fixes an existing issue where a node may write SSTables with block properties as part of a backup that are then ingested by an older node. This patch provides the infrastructure necessary for making these "cluster-external" SSTable operations engine-aware. Nodes should only use a table format version that other nodes in the cluster understand. Informs cockroachdb/pebble#1339. [1]: cockroachdb/pebble#1339 Release note: None

jbowens

Isn't this only saying that the binary version is up-to-date?

That's not my understanding. My understanding is once WriteClusterVersion(x) is invoked, you're guaranteed that all nodes have applied all migrations preceding x.

Isn't there still an element of asynchronicity in the way in which the store is ratcheted?

The call to WriteClusterVersion(x) is performed asynchronous, but if we performed the ratcheting in a migration that precedes x, it doesn't matter. We can always use the persisted cluster version as the indicator of what table format to use. (Note that the inverse is not true: This scheme means that we cannot read the local engine format major version and then decide that it's okay to send that format major version's table format to other nodes. I think it's slightly cleaner to use the cluster version rather than the local engine's format major version anyway, because the cluster version is the mechanism intended to control versioning of inter-node communication more generally.)

Tagging @tbg and @irfansharif for thoughts, because they helped me understand the existing infrastructure previously.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal, @jbowens, and @sumeerbhola)

nicktrav

I think I found what I was originally looking for (but I missed the edit in the node additions section of the RFC, on adding the fenced versions), in the comments in (*migrationmanager.Manager).Migrate(), here.

We loop over all nodes in the cluster first applying the fenced version, and then the "real" version. The "two-phase" approach that @sumeerbhola and I had originally talked about should work, and we don't need this extra RPC dance to wait for the engine versions.

I'm also wondering if we can get away with using the internal "fenced" version as our first version (Pebble would bump it's store version when it reaches the fenced version), and the second ("real") version would actually enable the feature (at which point we know all stores have been ratcheted).

I think we can close this PR in favor of the original approach, though I'll keep open until I get confirmation from @tbg and / or @irfansharif on the approach.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal, @jbowens, and @sumeerbhola)

irfansharif · 2022-02-08T13:25:55Z

I'm also wondering if we can get away with using the internal "fenced" version as our first version (Pebble would bump it's store version when it reaches the fenced version), and the second ("real") version would actually enable the feature (at which point we know all stores have been ratcheted).

Instead of using the "fenced" versions, which I think should really only be used for pkg/migration's internal book-keeping, could we instead introduce two cluster versions? It's entirely possible/sound to use multiple cluster versions to chain a sequence of things in order, given the migrations infra guarantees that all nodes will see vX before even attempting to move up to vX+1. The migration infra also lets you attach arbitrary checks as part of vX or vX+1, so you could reasonably ensure that everything is up to snuff. One recent example of using multiple versions can be found here:

cockroach/pkg/kv/kvserver/store.go

Lines 2118 to 2139 in e7540a1

    
           // We need a version gate here before switching over to the span configs 
        
           // infrastructure. In a mixed-version cluster we need to wait for 
        
           // the host tenant to have fully populated `system.span_configurations` 
        
           // (read: reconciled) at least once before using it as a view for all 
        
           // split/config decisions. 
        
           _ = clusterversion.EnsureSpanConfigReconciliation 
        
           // 
        
           // We also want to ensure that the KVSubscriber on each store is at least as 
        
           // up-to-date as some full reconciliation timestamp. 
        
           _ = clusterversion.EnsureSpanConfigSubscription 
        
           // 
        
           // Without a version gate, it would be possible for a replica on a 
        
           // new-binary-server to apply the static fallback config (assuming no 
        
           // entries in `system.span_configurations`), in violation of explicit 
        
           // configs directly set by the user. Though unlikely, it's also possible for 
        
           // us to merge all ranges into a single one -- with no entries in 
        
           // system.span_configurations, the infrastructure can erroneously conclude 
        
           // that there are zero split points. 
        
           // 
        
           // We achieve all this through a three-step migration process, culminating 
        
           // in the following cluster version gate: 
        
           _ = clusterversion.EnableSpanConfigStore

I'm not fully following the specifics of the PR, but happy to jump over call and brainstorm.

nicktrav · 2022-02-08T15:10:09Z

could we instead introduce two cluster versions

Yep, we can, and shall do.

given the migrations infra guarantees that all nodes will see vX before even attempting to move up to vX+1

Great. This is the guarantee we need.

One recent example of using multiple versions can be found here:

Thank you for this! Super useful.

Closing. TFTRs!

nicktrav requested review from itsbilal, jbowens, sumeerbhola and a team February 4, 2022 20:55

nicktrav requested a review from a team as a code owner February 4, 2022 20:55

nicktrav mentioned this pull request Feb 4, 2022

db: support range-key => value cockroachdb/pebble#1339

Closed

29 tasks

nicktrav force-pushed the nickt.engine-migration branch from e374d61 to 66db578 Compare February 4, 2022 21:29

jbowens reviewed Feb 7, 2022

View reviewed changes

nicktrav commented Feb 7, 2022

View reviewed changes

nicktrav force-pushed the nickt.engine-migration branch from 66db578 to bdbd428 Compare February 7, 2022 19:24

jbowens reviewed Feb 7, 2022

View reviewed changes

nicktrav commented Feb 7, 2022

View reviewed changes

nicktrav closed this Feb 8, 2022

nicktrav deleted the nickt.engine-migration branch February 8, 2022 15:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pkg/migration: support waiting for minimum engine version #76085

pkg/migration: support waiting for minimum engine version #76085

nicktrav commented Feb 4, 2022

cockroach-teamcity commented Feb 4, 2022

jbowens left a comment

nicktrav left a comment

jbowens left a comment

nicktrav left a comment

irfansharif commented Feb 8, 2022

nicktrav commented Feb 8, 2022

	// At the same time, with requests/RPCs originating at other crdb nodes, the
	// initiator of the request gets to decide what's supported. A node should
	// not refuse functionality on the grounds that its view of the version gate
	// is as yet inactive. Consider the sender:
	//
	// func invokeSomeRPC(req) {
	// if (specific-version is active) {
	// // Like mentioned above, this implies that all nodes in the
	// // cluster are running binaries that can handle this new
	// // feature. We may have learned about this fact before the
	// // node on the other end. This is due to the fact that migration
	// // manager informs each node about the specific-version being
	// // activated active concurrently. See BumpClusterVersion for
	// // where that happens. Still, it's safe for us to enable the new
	// // feature flags as we trust the recipient to know how to deal
	// // with it.
	// req.NewFeatureFlag = true
	// }
	// send(req)
	// }
	//
	// And consider the recipient:
	//
	// func someRPC(req) {
	// if !req.NewFeatureFlag {
	// // Legacy behavior...
	// }
	// // There's no need to even check if the specific-version is active.
	// // If the flag is enabled, the specific-version must have been
	// // activated, even if we haven't yet heard about it (we will pretty
	// // soon).
	// }
	//
	// See clusterversion.Handle.IsActive and usage of some existing versions
	// below for more clues on the matter.

pkg/migration: support waiting for minimum engine version #76085

pkg/migration: support waiting for minimum engine version #76085

Conversation

nicktrav commented Feb 4, 2022

cockroach-teamcity commented Feb 4, 2022

jbowens left a comment

Choose a reason for hiding this comment

nicktrav left a comment

Choose a reason for hiding this comment

jbowens left a comment

Choose a reason for hiding this comment

nicktrav left a comment

Choose a reason for hiding this comment

irfansharif commented Feb 8, 2022

nicktrav commented Feb 8, 2022