Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wip,dnm] storage: introduced dedicated raft storage #16809

Closed

Conversation

irfansharif
Copy link
Contributor

@irfansharif irfansharif commented Jun 30, 2017

Implements #16361.

This is a breaking change. To see why consider that prior to this we
stored all consensus data in addition to all system metadata and user
level keys in the same, single RocksDB instance. Here we introduce a
separate, dedicated instance for raft data (log entries and
HardState). Cockroach nodes simply restarting with these changes, unless
migrated properly, will fail to find the most recent raft long entries
and HardState data in the new RocksDB instance.

Also consider a cluster running mixed versions (nodes with dedicated
raft storage and nodes without), what would the communication between
nodes here like in light of proposer evaluated
KV? Current we propagate a `storagebase.WriteBatch` through raft
containing a serialized representation of a RocksDB write batch, this
models the changes to be made to the single underlying RocksDB instance.
For log truncation requests where we delete log entries and/or admin
splits where we write initial HardState for newly formed replicas, we
need to similarly propagate a write batch (through raft) addressing the
new RocksDB instance (if the recipient node is one with these changes)
or the original RocksDB instance (if the recipient node is one without
these changes). What if an older version node is the raft leader and is
therefore the one upstream of raft, propagating `storagebase.WriteBatches`
with raft data changes but addressed to the original RocksDB instance?
What would rollbacks look like?

To this end we introduce two modes of operation,
transitioningRaftStorage and enabledRaftStorage (this is implicit if
we're not in transitioning mode). We've made it so that it is safe to
transition between an older cockroach version to
transitioningRaftStorage, from transitioningRaftStorage to
enabled and the reverse for rollbacks. Transition from one
mode to the next will take place when all the nodes in the cluster are
on the same previous mode. The operation mode is set by a user level
flag when starting a cockroach binary with `cockroach start --transitioning ...`

- In the old version we use a single RocksDB instance for both raft
  and user-level KV data
- In transitioningRaftStorage mode we use both RocksDB instances for raft
  data interoperably, the raft specific and the regular instance. We use
  this mode to facilitate rolling upgrades
- In enabled mode we use the dedicated RocksDB instance for raft data.
  Raft log entries and the HardState are stored on this instance alone

Most of this commit is careful plumbing of an extra
engine.{Engine,Batch,Reader,Writer,ReadWriter} for whenever we need to
interact with the new RocksDB instance.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@irfansharif
Copy link
Contributor Author

irfansharif commented Jun 30, 2017

note: The point above having older versions upstream of raft is as yet unresolved, I have documented a solution but it's not the cleanest one. Before addressing that I'd like to first know how I can test these cross-version migrations. The same is true for the offline store-level migration utility (?) described in the RFC, I'm not sure where to 'fit' this in.
edit: the failing acceptance tests tell me the answer lies in acceptance tests.
edit: Store.Start is a good place to run this migration when detecting transition across major versions.

There's no RocksDB tuning here as yet, will follow on after this.
re: Tests, insofar my "testing" has been reading from the new engine and the old and verify they're exactly identical. Will add some unit tests but they'll be limited in scope, between the existing raft log + snapshot tests there's little ground left to cover.

@irfansharif irfansharif force-pushed the dedicated-raft-storage branch 4 times, most recently from 8e21b33 to 34a75d0 Compare June 30, 2017 08:03
@irfansharif irfansharif force-pushed the dedicated-raft-storage branch from 34a75d0 to 1fc8479 Compare July 1, 2017 07:56
@bdarnell
Copy link
Contributor

bdarnell commented Jul 2, 2017

Yes, the acceptance tests (specifically the ones in reference_test.go) are currently our best tests of upgrade-related issues. They use a docker image that contains an old version of the binary (this should probably be refactored so we can use multiple older versions).

I wonder if it would be better to encapsulate most of this at the Engine level: instead of two Engines, we could have a new implementation of the Engine interface that wraps two other Engines and switches between them based on the keys it is given.

I had been envisioning the transition as two states instead of three: Whenever a new binary starts up, it is in the "transitioning" state, and it is still possible to roll back to the old version. When the admin changes to the "enabled" state, it is no longer possible to roll back to an older version, although it should be possible to roll back to the "transitioning" state of the new binary. (The old binary is effectively the "disabled" state). The three-state version is nice if we can do it (the fewer restrictions we place on rollbacks, the better), but I worry that it will be tricky to implement and under-tested (once you've transitioned to the "disabled" state, when does it become safe to do the downgrade?)

Store.Start is a good place to run a migration if a restart is required to make the change. We'd prefer not to do that (although it's not out of the question). Instead of a migration that copies data from one engine to the other, I wonder if it would be better to just let the truncation process happen naturally (reading from both engines in the meantime).


Reviewed 29 of 29 files at r1, 1 of 1 files at r2.
Review status: all files reviewed at latest revision, 8 unresolved discussions, some commit checks failed.


pkg/server/config.go, line 406 at r2 (raw file):

// CreateEngines creates Engines based on the specs in cfg.Stores.
func (cfg *Config) CreateEngines(ctx context.Context) (Engines, Engines, error) {

I'd generally prefer to return an array of (engine, raftEngine) pairs instead of two parallel slices, unless it's too painful to make that refactoring.

The comment on this method needs to be updated with the new return values.


pkg/server/config.go, line 411 at r2 (raw file):

	var raftEngines Engines
	if storage.TransitioningRaftStorage || storage.EnabledRaftStorage {

This repeated expression should perhaps be encapsulated in some global function.


pkg/server/server.go, line 683 at r2 (raw file):

	}
	s.stopper.AddCloser(&s.engines)
	if storage.TransitioningRaftStorage || storage.EnabledRaftStorage {

Refer to the configuration variables as little as possible. Here, for example, you should just close s.raftEngines unconditionally.

As a general rule I'd suggest creating the raft engine (and maybe batches) unconditionally, and the current transition state would only be used to A) decide which batch to write to and B) to skip operations that would impact the overall performance (like committing the write batch).


pkg/storage/client_split_test.go, line 1217 at r2 (raw file):

func TestSplitSnapshotRace_SnapshotWins(t *testing.T) {
	defer leaktest.AfterTest(t)()
	t.Skip()

FYI this test has been flaky on master recently, so it may not be your fault.


pkg/storage/client_test.go, line 1003 at r2 (raw file):

	ctx := context.TODO()
	startKey := m.findStartKeyLocked(rangeID)
	log.Infof(context.TODO(), "skey: %v", startKey)

Remember to remove this before merging.


pkg/storage/replica_raftstorage.go, line 166 at r2 (raw file):

		// Was the missing index after the last index?
		// TODO(irfansharif): Explore writing last index to raft engine.

Yes, it sounds like a good idea to move the last index to the raft engine.


pkg/storage/replica_raftstorage.go, line 180 at r2 (raw file):

	// No results, was it due to unavailability or truncation?
	// TODO(irfansharif): Explore writing truncated state to raft engine.

I'm less sure about this one but it's probably a good idea to move TruncatedState too.


pkg/storage/store.go, line 158 at r2 (raw file):

)

// TODO(irfansharif): Changing this to a cluster setting instead makes it

Yes, this should be a cluster setting (which means that the global variables will have to be changed to functions).


Comments from Reviewable

@tbg
Copy link
Member

tbg commented Jul 3, 2017

but I worry that it will be tricky to implement and under-tested (once you've transitioned to the "disabled" state, when does it become safe to do the downgrade?)

Can't we make this downgrade path safe indefinitely? When a binary starts up in transitioning, it writes from its new Raft storage back to the old one if the new one is "ahead" (and vice versa), and in running operation maintains both. What's not clear to me is what the update would look like from a user's perspective and how they could get it wrong. Naively, one would think they'd do

  • copy the new binaries
  • run the new binaries (which automatically puts them in transitioning state). Everything works, but it's slower because it maintains both engines. Periodically warns about the transitioning state and the need for a further next step.
  • rolling restart, moving each binary to the full enabled state
  • the env variable becomes obsolete in the next release. We somehow make sure that you can't skip over the release that introduced the flag.

Perhaps instead, we should require the explicit TRANSITIONING mode flag so that the extra step removes the env var (that will soon become irrelevant) again.

Downgrades work the opposite - set transitioning flag, rolling restart, copy old binaries, rolling restart. Does seem testable enough to me with a handful of acceptance tests.

Or am I completely misunderstanding how the up/downgrade process is envisioned?

I wonder if it would be better to encapsulate most of this at the Engine level: instead of two Engines, we could have a new implementation of the Engine interface that wraps two other Engines and switches between them based on the keys it is given.

Maybe my opinion will change as I actually review the code, but on an abstract level I appreciate the explicitness the as-is approach brings.

Edit: OK, reviewed the change and the impression remains.


Reviewed 29 of 29 files at r1.
Review status: all files reviewed at latest revision, 27 unresolved discussions, some commit checks failed.


pkg/server/config.go, line 526 at r1 (raw file):

	engines = nil
	raftEnginesCopy := raftEngines
	raftEngines = nil

While you're here, add a comment on this nil dance and how it ties in with the defer at the beginning of the method.


pkg/server/config.go, line 411 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This repeated expression should perhaps be encapsulated in some global function.

can't this be unconditional? If there aren't any raft engines, it would be a no-op.


pkg/server/config_test.go, line 45 at r1 (raw file):

	}
	if storage.TransitioningRaftStorage || storage.EnabledRaftStorage {
		defer raftEngines.Close()

can't this be unconditional? If there aren't any raft engines, it would be a no-op.


pkg/server/config_test.go, line 69 at r1 (raw file):

	}
	defer engines.Close()
	if storage.TransitioningRaftStorage || storage.EnabledRaftStorage {

ditto.


pkg/server/node.go, line 387 at r1 (raw file):

	n.startGossip(n.stopper)

	log.Infof(ctx, "%s: started with %v engine(s), %v raft engines and attributes %v", n, engines, raftEngines, attrs.Attrs)

s/engines/engine(s)/


pkg/server/server.go, line 683 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Refer to the configuration variables as little as possible. Here, for example, you should just close s.raftEngines unconditionally.

As a general rule I'd suggest creating the raft engine (and maybe batches) unconditionally, and the current transition state would only be used to A) decide which batch to write to and B) to skip operations that would impact the overall performance (like committing the write batch).

👍


pkg/storage/replica.go, line 4251 at r1 (raw file):

		if TransitioningRaftStorage || EnabledRaftStorage {
			// TODO(irfansharif): Is it ever the case that we have an empty

Wouldn't that be the common case? Might be missing something, but my intuition is that RaftData would only be nontrivial for special commands like ChangeReplica, Split, etc.
Clarifying comment would help.


pkg/storage/replica.go, line 4258 at r1 (raw file):

			// we're in TransitioningRaftStorage mode we should ensure that
			// data (log entries and HardState) is copied over to the new
			// engine.

This is where Ben's suggestion looks like it would be more straightforward, but for it to work we'd need to make ApplyBatchRepr do the multiplexing into two underlying engines as well, which is more trouble than it's worth. Doing this explicitly looks like the way to go: I think it's fairly straightforward (since the condition on keys should be just matching a small set of prefixes) to make a command that takes a WriteBatch and extracts from it two write batches belonging to the regular and Raft engine, respectively.


pkg/storage/replica.go, line 4300 at r1 (raw file):

	start := timeutil.Now()
	isLogTruncationRequest := rResult.RaftLogDelta != nil

Add a comment.


pkg/storage/replica_command.go, line 3347 at r1 (raw file):

	if TransitioningRaftStorage || EnabledRaftStorage {
		localRangeIDUnreplicatedPrefix := keys.MakeRangeIDUnreplicatedPrefix(rightRangeID)
		if _, _, _, err := engine.MVCCDeleteRange(ctx, raftBatch, nil,

You could be more specific here by using (basically) makeRaftEngineKeyRanges). Not sure if it's worth it, but would be more explicit about what's actually being removed.


pkg/storage/replica_data_iter.go, line 56 at r1 (raw file):

// makeRaftEngineKeyRanges returns two key ranges, one for the HardState and
// one for the raft log entries associated for the given range descriptor.
func makeRaftEngineKeyRanges(d *roachpb.RangeDescriptor) []keyRange {

Only needs a RangeID.


pkg/storage/replica_raftstorage.go, line 64 at r1 (raw file):

	ctx := r.AnnotateCtx(context.TODO())
	// For uninitialized ranges, membership is unknown at this point.
	hs, err := r.mu.stateLoader.loadHardState(ctx, r.store.RaftEngine())

Moving the comment one line up seems spurious. The comment refers to the return statement (perhaps just move it there).


pkg/storage/replica_raftstorage.go, line 544 at r1 (raw file):

		}
		if TransitioningRaftStorage {
			err := engine.MVCCDelete(ctx, batch, nil, r.raftMu.stateLoader.RaftLogKey(i),

Your usual stats comment is missing.


pkg/storage/replica_raftstorage.go, line 638 at r1 (raw file):

		for _, keyRange := range makeRaftEngineKeyRanges(desc) {
			// The metadata ranges have a relatively small number of keys making usage

Looks like this comment was left accidentally.


pkg/storage/replica_raftstorage.go, line 166 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Yes, it sounds like a good idea to move the last index to the raft engine.

👍 Also seems like this change is the one to do it.


pkg/storage/replica_raftstorage.go, line 180 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I'm less sure about this one but it's probably a good idea to move TruncatedState too.

That would be good in order to facilitate faster log truncation, too. We currently shove all of that through Raft, but there's really no good reason afaict. The lease holder could just truncate its logs when it feels like it and let the followers know what to truncate to outside of Raft. If this PR were able to make the truncated index unreplicated (enough) to do this as a follow-up, that would be good; the transitioning state is a good opportunity for these kinds of things. Have to think it through, though.


pkg/storage/replica_raftstorage_test.go, line 116 at r1 (raw file):

}

func BenchmarkReplicaRaftStorage(b *testing.B) {

The benchmark name is pretty generic. Wouldn't hurt to mention you're doing serial puts; it's arguable whether this benchmark is really specific to raft storage.

But this makes me think: when in the new mode, what do all the other benchmarks run with? Will they need updates as well to run with "realistic" Raft engines, too?


pkg/storage/replica_raftstorage_test.go, line 118 at r1 (raw file):

func BenchmarkReplicaRaftStorage(b *testing.B) {
	for _, valueSize := range []int{1 << 10, 1 << 12, 1 << 14, 1 << 16, 1 << 18, 1 << 20} {
		b.Run(fmt.Sprintf("vs=%d", valueSize), func(b *testing.B) {

nit: s/vs/valSize/.


pkg/storage/store.go, line 137 at r1 (raw file):

// Consider the example where we introduced a dedicated RocksDB instance for
// raft data where the following modes are used. Briefly, the major version
// with this feature stored raft data (log entries and raft HardState) in a

nit: s/stored/stores/

s/log entries/such as log entries/ (anticipating that we might add lastIndex, and perhaps truncated state)


pkg/storage/store.go, line 178 at r1 (raw file):

//   by copying over all existing raft data (log entries + HardState) into the new
//   dedicated raft engine
// - Nodes will be restarted to run in this mode, they will be able to

Grammar is off throughout this sentence.


pkg/storage/store.go, line 1256 at r1 (raw file):

			// TODO(irfansharif): Will need to copy over hard state + log
			// entries for each range if running in transitioning mode and we
			// were on an old cockroach version before.

Or if we were on a new version before, need to copy from dedicated engine to base engine.


pkg/storage/store.go, line 2428 at r1 (raw file):

		capacity.Capacity += raftEngCapacity.Capacity
		capacity.Available += raftEngCapacity.Available

This isn't really right, but probably the most straightforward way to handle it for now. Should make sure it isn't getting lost, though. Add a TODO?


pkg/storage/store.go, line 158 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Yes, this should be a cluster setting (which means that the global variables will have to be changed to functions).

What is your plan with the cluster setting, Ben? Nodes will need to rewrite on-disk data which is hard to do once they're running. Would the idea be that nodes store the "last" cluster setting they've seen and apply it on restart? It's tricky to guarantee that all nodes have seen the latest setting, a problem you don't have with an env var. We could provide a helper - ./cockroach migrate raft.dedicated_storage='transitioning' which would set the env var, wait for it to have propagated, and then prompts the user to do a rolling restart after which they run ./cockroach migrate raft.dedicated_storage='enabled' and get another set of instructions and a round of rolling restart. Or what were you envisioning? I find the lack of explicit control troubling as it increases the chance to find yourself in unscripted situations. Then again, setting env vars may not be the best solution for some deployments.


pkg/storage/engine/rocksdb.go, line 324 at r1 (raw file):

// NewTestRocksDB allocates and returns an on-disk RocksDB instance. The
// caller must call the engine's Close method when the engine is no longer
// needed which will delete the files stored on disk.

I'm always wary of this pattern. If you pass in a dir, I think the caller should have to clean up the directory. Alternatively, the caller can pass an existing dir and the engine promises to remove all its files when it's closed (i.e. you'd just make a tempdir inside of dir and rmdir that at the end).


Comments from Reviewable

@a-robinson
Copy link
Contributor

I had been thinking along the same lines as @bdarnell, expecting two states rather than three. Some of my thoughts:

  • Having to monkey with multiple env vars values in order to finish an upgrade is a pretty awkward user experience
  • Downgrade could be supported in the two-phase system by adding a reverse command that does the copy @tschottdorf mentioned about moving back from enabled into transitioning - "it writes from its new Raft storage back to the old one if the new one is ahead". That way we'd have a way of switching a node between its two states (disabled and enabled) and thus supporting downgrades.
  • I'm pretty wary of trying to maintain two raft logs at once while in the transitioning state. We're going to need a lot of error handling code to try to keep them in sync if one rocksdb instance is having problems and the other isn't. It's not quite the same as a distributed consensus problem, but it's still a consensus problem.
  • If we're able to write up logic for which keys go to which rocksdb downstream of raft without having to change the propEvalKV proto, it'd be really nice to avoid the headache of dealing with the interoperability concerns there (especially around downgrades -- it seems like users could easily mess something up in a way that could corrupt their raft state). Examining the keys and messing with the write batches downstream might cost too much performance though, for all I know.

Also, could you expand on what you mean by the below snippet from your commit message? I take it that the cluster would have to start tracking which nodes are in which modes?

Transition from one
mode to the next will take place when all the nodes in the cluster are
on the same previous mode

@irfansharif
Copy link
Contributor Author

I wonder if it would be better to encapsulate most of this at the Engine level: instead of two Engines, we could have a new implementation of the Engine interface that wraps two other Engines and switches between them based on the keys it is given

I still prefer the explicit as-is approach as opposed to making a two-pronged storage.Engine implementation. Naively I'd guess there'd be some perf overhead multiplexing writes using prefix matching for all keys.
Additionally one thing that didn’t make the cut was user-level specified flags for where the raft storage engines would be (think cockroach start —raft-stores, akin to our existing —stores), I feel this would be cleaner to do with the raft engine plumbed through as is the case now (if we have m raft engines and n base engines, it'd be harder to swap things in and out if we throw two pronged engines in the mix).

Store.Start is a good place to run a migration if a restart is required to make the change. We'd prefer not to do that (although it's not out of the question). Instead of a migration that copies data from one engine to the other, I wonder if it would be better to just let the truncation process happen naturally (reading from both engines in the meantime).

I have the single Store.Start store level migration now and it's cleaner than I had expected.
Some issues that arise with letting the transitioning process happen naturally via log truncations

  • Now our transitions can take arbitrarily long, what if there aren't writes going through anymore and we have nothing to truncate? Changing our truncation logic for this special case is non-ideal
  • Now we have two transitioning modes, one is before truncations have happened so we're still effectively in the migration process and one after, we've completed the migration process (i.e. copied over raft data to the right engine). This second mode is where we run maintain both engines and when all nodes are at this stage, we're safe to run another rolling restart. I think this is more work than it needs to be when we can simply run a one time store level migration

As for the remaining migration related concerns, here's what I have distilled it to:

  • Removed the DISABLED mode, I was effectively using it as a feature flag when in fact it is essentially the older running version. It's now two states instead of three in that we have transitioning as before but enabled is implicit, if it's not transitioning it's enabled.
  • Introduced an explicit --transitioning flag (Documentation, TODO) which runs the cockroach binary in transitioning mode. This is false by default so we're running the (faster) enabled mode unless explicitly specified.

The rolling restarts (upgrades or version rollbacks) is now a two step process, looks as follows:

  • We have an existing cluster running an old cockroach version
  • We perform a rolling restart with the new binaries, except that they're started with an additional --transitioning flag
  • Once all nodes are running the new binary with the --transitioning flag set, we'll perform another rolling restart without this flag

Gauging from a few kubernetes docs on rolling version changes this seems to be the best thing to do for most deployments. It's scriptable too as we can immediately run the second rolling restart after the success of the first (indicating we have all nodes running in transitioning mode). The same applies for version rollbacks.
As for the next version upgrade (think 1.2), this 2-phase approach makes for an easier story (for us) if we structure it so that if a user intends to move between 1.0 to 1.2, the user does so by stepping through all the intermediary major versions first. For the next version upgrade (1.2) this makes way for the eventual deprecation of this flag usage pertaining to the raft engine changes, we simply assume the previous version is always one with the raft data in the dedicated engine.

Also, could you expand on what you mean by the below snippet from your commit message? I take it that the cluster would have to start tracking which nodes are in which modes?

The explicit decision I've made here is to punt off the global view of the cluster (what versions are all the nodes in the cluster running?) to the operator.

I'm pretty wary of trying to maintain two raft logs at once while in the transitioning state. We're going to need a lot of error handling code to try to keep them in sync if one rocksdb instance is having problems and the other isn't. It's not quite the same as a distributed consensus problem, but it's still a consensus problem.

Could you perhaps give an instance of how this could come about in practice? Will help me better understand this.


Review status: 2 of 34 files reviewed at latest revision, 27 unresolved discussions, some commit checks failed.


pkg/server/config.go, line 526 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

While you're here, add a comment on this nil dance and how it ties in with the defer at the beginning of the method.

Done.


pkg/server/config.go, line 406 at r2 (raw file):

I'd generally prefer to return an array of (engine, raftEngine) pairs instead of two parallel slices, unless it's too painful to make that refactoring.

I did try this and it's very pervasive, so skipped for now.

The comment on this method needs to be updated with the new return values.
Done.


pkg/server/config.go, line 411 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

can't this be unconditional? If there aren't any raft engines, it would be a no-op.

Removed DisabledRaftStorage mode making this unconditional. It's not a no-op, we'll always have raft engines going forward.


pkg/server/config_test.go, line 45 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

can't this be unconditional? If there aren't any raft engines, it would be a no-op.

Removed DisabledRaftStorage mode making this unconditional. It's not a no-op, we'll always have raft engines going forward.


pkg/server/config_test.go, line 69 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

ditto.

Ditto.


pkg/server/node.go, line 387 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

s/engines/engine(s)/

Done.


pkg/server/server.go, line 683 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

👍

Done.

Removing DisabledRaftStorage mode made this unconditional. This makes way for this pattern throughout:

  • Create raft engines unconditionally
  • transitioningRaftStorage mode is our signal to write raft data (log entries, HardState and raft last index) to both engines (for interoperability)

We can't skip operations for performance here, we will need to commit the write batches on both engines for transitioning mode.


pkg/storage/client_test.go, line 1003 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Remember to remove this before merging.

woops, Done.


pkg/storage/replica.go, line 4251 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Wouldn't that be the common case? Might be missing something, but my intuition is that RaftData would only be nontrivial for special commands like ChangeReplica, Split, etc.
Clarifying comment would help.

It is the common case, yes, but Engine.NewBatch().Repr() != nil, but some zeroed byte slice. It's nil when upstream of raft WriteBatch is constructed by an older cockroach version, which is now accounted for. Also added a clarifying comment, PTAL.


pkg/storage/replica.go, line 4258 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

This is where Ben's suggestion looks like it would be more straightforward, but for it to work we'd need to make ApplyBatchRepr do the multiplexing into two underlying engines as well, which is more trouble than it's worth. Doing this explicitly looks like the way to go: I think it's fairly straightforward (since the condition on keys should be just matching a small set of prefixes) to make a command that takes a WriteBatch and extracts from it two write batches belonging to the regular and Raft engine, respectively.

I still prefer the explicit as-is approach as opposed to making a two-pronged storage.Engine implementation, reasons discussed in the top level comment.


pkg/storage/replica.go, line 4300 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Add a comment.

Welp, this was far too subtle for me to have missed explaining it. Done.


pkg/storage/replica_command.go, line 3347 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

You could be more specific here by using (basically) makeRaftEngineKeyRanges). Not sure if it's worth it, but would be more explicit about what's actually being removed.

no that's a good idea, this was a relic of pre makeRaftEngineKeyRanges. Should be used here as well.


pkg/storage/replica_data_iter.go, line 56 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Only needs a RangeID.

Done, turns out I needed it as well.


pkg/storage/replica_raftstorage.go, line 64 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Moving the comment one line up seems spurious. The comment refers to the return statement (perhaps just move it there).

Done.


pkg/storage/replica_raftstorage.go, line 544 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Your usual stats comment is missing.

Done.


pkg/storage/replica_raftstorage.go, line 638 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Looks like this comment was left accidentally.

no actually, though a repetition of the same comment above it applies here given our usage of ClearIterRange as opposed to ClearRange, but I've merged the two.


pkg/storage/replica_raftstorage.go, line 166 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

👍 Also seems like this change is the one to do it.

Done, was refreshingly easy to do so.


pkg/storage/replica_raftstorage.go, line 180 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

That would be good in order to facilitate faster log truncation, too. We currently shove all of that through Raft, but there's really no good reason afaict. The lease holder could just truncate its logs when it feels like it and let the followers know what to truncate to outside of Raft. If this PR were able to make the truncated index unreplicated (enough) to do this as a follow-up, that would be good; the transitioning state is a good opportunity for these kinds of things. Have to think it through, though.

yup, would like to keep this in a separate PR and possibly refactor evalTruncateLog as mentioned in the RFC in same fell swoop.


pkg/storage/replica_raftstorage_test.go, line 116 at r1 (raw file):
Ok this benchmark was not doing what I intended in that even if we were using the same engine, we were doing extra work in creating a separate batch when only one would do. Also in TRANSITIONING mode we were writing the same (raft data) keys twice to the same engine, really it should be tested only in ENABLED mode. To this end I pulled out a variant of this into the commit prior to get the overall perf changes.

Renamed to BenchmarkSerialPuts.

But this makes me think: when in the new mode, what do all the other benchmarks run with? Will they need updates as well to run with "realistic" Raft engines, too?
At the time of writing this is the only benchmark with an on-disk (this is what you meant by 'realistic', yes?) RocksDB implementation. Everything else falls back to an in memory implementation of the new engine, as before.


pkg/storage/replica_raftstorage_test.go, line 118 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

nit: s/vs/valSize/.

Done.


pkg/storage/store.go, line 137 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

nit: s/stored/stores/

s/log entries/such as log entries/ (anticipating that we might add lastIndex, and perhaps truncated state)

Done.


pkg/storage/store.go, line 178 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Grammar is off throughout this sentence.

doh, Done.


pkg/storage/store.go, line 1256 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Or if we were on a new version before, need to copy from dedicated engine to base engine.

hmm, PTAL here. I tried a couple of ideas here to detect in if we're indeed upgrading or downgrading but it got too hacky, settled for a dumber implementation instead.

Update: ugh, this isn't completely correct, haven't had a chance to update this. One thing I have not addressed is the clean up of the /raft sub-dir if the migration process fails midway and new nodes running with --transitioning are reverted to the old binary. Naively this is a non-issue but if this isn't cleaned up and the user tries migration process again after a while, we have stale data in the raft engine. We need a way to detect if we're indeed upgrading or downgrading as we can have raft data on both engines (as is in the case described above), but one overwrites the other. I think @spencerkimball's working on this now.


pkg/storage/store.go, line 2428 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

This isn't really right, but probably the most straightforward way to handle it for now. Should make sure it isn't getting lost, though. Add a TODO?

Done.


pkg/storage/store.go, line 158 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

What is your plan with the cluster setting, Ben? Nodes will need to rewrite on-disk data which is hard to do once they're running. Would the idea be that nodes store the "last" cluster setting they've seen and apply it on restart? It's tricky to guarantee that all nodes have seen the latest setting, a problem you don't have with an env var. We could provide a helper - ./cockroach migrate raft.dedicated_storage='transitioning' which would set the env var, wait for it to have propagated, and then prompts the user to do a rolling restart after which they run ./cockroach migrate raft.dedicated_storage='enabled' and get another set of instructions and a round of rolling restart. Or what were you envisioning? I find the lack of explicit control troubling as it increases the chance to find yourself in unscripted situations. Then again, setting env vars may not be the best solution for some deployments.

continuing this discussion up above.


pkg/storage/client_split_test.go, line 1217 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

FYI this test has been flaky on master recently, so it may not be your fault.

Removed in lieu of #16893.


Comments from Reviewable

BenchmarkSerialPuts benchmarks write performance for different payload
sizes on an on disk RocksDB instance.
Implements cockroachdb#16361.

This is a breaking change. To see why consider that prior to this we
stored all consensus data in addition to all system metadata and user
level keys in the same, single RocksDB instance. Here we introduce a
separate, dedicated instance for raft data (log entries and
HardState). Cockroach nodes simply restarting with these changes, unless
migrated properly, will fail to find the most recent raft long entries
and HardState data in the new RocksDB instance.

Also consider a cluster running mixed versions (nodes with dedicated
raft storage and nodes without), what would the communication between
nodes here like in light of proposer evaluated
KV? Current we propagate a storagebase.WriteBatch through raft
containing a serialized representation of a RocksDB write batch, this
models the changes to be made to the single underlying RocksDB instance.
For log truncation requests where we delete log entries and/or admin
splits where we write initial HardState for newly formed replicas, we
need to similarly propagate a write batch (through raft) addressing the
new RocksDB instance (if the recipient node is one with these changes)
or the original RocksDB instance (if the recipient node is one without
these changes). What if an older version node is the raft leader and is
therefore the one upstream of raft, propagating storagebase.WriteBatches
with raft data changes but addressed to the original RocksDB instance?
What would rollbacks look like?

To this end we introduce three modes of operation,
transitioningRaftStorage and enabledRaftStorage (this is implicit if
we're not in transitioning mode). We've made it so that it is safe to
transition between an older cockroach version to
transitioningRaftStorage, from transitioningRaftStorage to
enabled and the reverse for rollbacks. Transition from one
mode to the next will take place when all the nodes in the cluster are
on the same previous mode. The operation mode is set by an env var
COCKROACH_DEDICATED_RAFT_STORAGE={DISABLED,TRANSITIONING,ENABLED}

- In the old version we use a single RocksDB instance for both raft
  and user-level KV data
- In transitioningRaftStorage mode we use both RocksDB instances for raft
  data interoperably, the raft specific and the regular instance. We use
  this mode to facilitate rolling upgrades
- In enabled mode we use the dedicated RocksDB instance for raft data.
  Raft log entries and the HardState are stored on this instance alone

Most of this commit is careful plumbing of an extra
engine.{Engine,Batch,Reader,Writer,ReadWriter} for whenever we need to
interact with the new RocksDB instance.
- Include lastIndex in raft engine
- Address backwards compatibility and preserve safety for clusters running
  multiple versions
- Introduce --transitioning flag
- Address review comments
@irfansharif irfansharif force-pushed the dedicated-raft-storage branch from f21ad2f to 20a3981 Compare July 10, 2017 17:56
@tbg
Copy link
Member

tbg commented Jul 10, 2017

Your change is effective at avoiding that people run clusters in the slow transitioning mode for extended periods of time. It's however not effective at making sure people don't screw up the upgrade (by forgetting to set the flag). Can we make it so that the new version won't start without --transitioning unless it has been started with it before? Should be easy to achieve via a marker write. Similarly, we should guard against someone skipping a minor version (or more). That is, each release writes a marker file and the subsequent one won't start unless the marker file is in a valid state. Perhaps we have something like this ready to use already. No need to do anything here until you've gotten more eyes on the change as-is.


Reviewed 30 of 34 files at r3, 2 of 3 files at r4.
Review status: all files reviewed at latest revision, 26 unresolved discussions, some commit checks failed.


pkg/server/config.go, line 406 at r2 (raw file):

Previously, irfansharif (irfan sharif) wrote…

I'd generally prefer to return an array of (engine, raftEngine) pairs instead of two parallel slices, unless it's too painful to make that refactoring.

I did try this and it's very pervasive, so skipped for now.

The comment on this method needs to be updated with the new return values.
Done.

In the meantime, named return values (used only for the names) could help here:

(raftEngines, dataEngines Engines, error)


pkg/server/config.go, line 239 at r4 (raw file):

	PIDFile string

	// We define two modes of operation during migrations across subsequent major

We're talking about 1.1->1.2, which is a minor update, not major.


pkg/server/config.go, line 522 at r4 (raw file):

			engSize := (9 * sizeInBytes) / 10
			eng := engine.NewInMem(spec.Attributes, engSize)
			// TODO(irfansharif): For now we specify initialize the raft

This comment belongs above engSize := ....


pkg/server/server_test.go, line 440 at r4 (raw file):

	defer leaktest.AfterTest(t)()

	t.Skip("TODO(irfansharif): Remove this, local hostname problem")

reminder and also perhaps https://stackoverflow.com/questions/40027067/cannot-resolve-local-hostname-after-upgrading-to-macos-sierra helps.


pkg/storage/replica.go, line 3190 at r4 (raw file):

		// All of the entries are appended to distinct keys, returning a new
		// last index.
		thinEntries, err := r.maybeSideloadEntriesRaftMuLocked(ctx, rd.Entries)

Hope that rebase wasn't too obnoxious.


pkg/storage/replica.go, line 4397 at r4 (raw file):

		// subset of commands that modify raft data through raft (such as log
		// entries truncated via TruncateLogRequests). Naively one would assume
		// that in these cases WriteBatch.RaftData == nil but is in fact a

s/is in/it is in/


pkg/storage/replica.go, line 4406 at r4 (raw file):

		// key ranges, not just the delta introduced in the current WriteBatch.
		// We do this as we do not expect to be in transitioningRaftStorage
		// mode for very long.

We may not want users to run in this mode for very long, but they may have to stay in it for a while until they can take on the second round of restarts, and so we need to keep a certain performance baseline here. The code below looks expensive and quadratic: for each log entry we copy basically the whole log.


pkg/storage/replica.go, line 4443 at r4 (raw file):

			if err := raftBatch.ApplyBatchRepr(writeBatch.RaftData, false); err != nil {
				return enginepb.MVCCStats{}, roachpb.NewError(NewReplicaCorruptionError(
					errors.Wrap(err, "unable to apply WriteBatch.RaftData to raft eng")))

s/eng/raft engine/


pkg/storage/replica.go, line 4448 at r4 (raw file):

				if err := batch.ApplyBatchRepr(writeBatch.RaftData, false); err != nil {
					return enginepb.MVCCStats{}, roachpb.NewError(NewReplicaCorruptionError(
						errors.Wrap(err, "unable to apply WriteBatch.RaftData to eng")))

s/eng/data engine/


pkg/storage/replica_data_iter.go, line 60 at r4 (raw file):

// - the raft log entries associated for the given range descriptor
func makeRaftEngineKeyRanges(rangeID roachpb.RangeID) []keyRange {
	hskey := keys.RaftHardStateKey(rangeID)

while you're here, consider satisfying my OCD: hsKey, rlpKey, ...


pkg/storage/store.go, line 178 at r1 (raw file):
How about

Nodes restarted in this mode will be able to ...


pkg/storage/store.go, line 2428 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Done.

nit: empty comment line between comment and TODO.


pkg/storage/store.go, line 143 at r4 (raw file):

//   WriteBatch.Data) in addition to the new instance (see WriteBatch.RaftData)
// - Once all the nodes are running in this mode, each can be independently
//   set to run in the EnabledRaftStorage mode and thus operating optimally.

s/operating/operate/


pkg/storage/store.go, line 1207 at r4 (raw file):

	s.startedAt = now.WallTime

	s.mu.Lock()

Where's the Unlock()?


pkg/storage/store.go, line 1209 at r4 (raw file):

	s.mu.Lock()

	if err := s.runStoreLevelMigrationsLocked(ctx); err != nil {

Passing transitioningRaftStorage around from the config as much as we can would be better. Are you not doing that because of the tests? You could use a boolean singleton which is used in tests to influence which mode they're in.


pkg/storage/store.go, line 4024 at r4 (raw file):

// runStoreLevelMigrationsLocked runs store level migrations for the
// dedicated raft engine changes.

The method's name is more general than the description suggests.


pkg/storage/store.go, line 4046 at r4 (raw file):

		// to is the engine we'll be writing raft data to, from is the engine
		// we'll be writing raft data from.
		to, from := raftBatch, batch

Would just eliminate raftBatch and batch.


pkg/storage/store.go, line 4051 at r4 (raw file):

				func(desc roachpb.RangeDescriptor) (bool, error) {
					// Arbitrarily we check to see if the HardState is present
					// in the 'to' engine, and if so we swap with 'from'.

But they could both have a HardState, right? Also, should nuke the Raft state in the base engine if not transitioning (maybe after that I could believe that you only need to do this simple swap here).


Comments from Reviewable

@bdarnell
Copy link
Contributor

if we have m raft engines and n base engines

That's a good point I hadn't thought of before. I guess there's no reason to require a 1:1 mapping of these engines.

I agree with @tschottdorf that requiring the flag to be added atomically with the new binary rollout seems error prone. I think new nodes should start in "transitioning" mode until they get the signal that it's safe to use the new mode. I don't like using a command-line change and a rolling restart (which is still at least a little disruptive) to implement this signal. I think it should be something gossiped instead (probably a "min version" cluster setting that can be shared along with other migrations, as we discussed with @spencerkimball and @a-robinson this morning).

One advantage of the rolling restart is that it spreads the change out over time, whereas the cluster setting takes effect more or less immediately. We may want to have nodes add a random delay from the time they get the min-version gossip before it takes effect.


Reviewed 19 of 34 files at r3.
Review status: all files reviewed at latest revision, 24 unresolved discussions, some commit checks failed.


Comments from Reviewable

@tbg
Copy link
Member

tbg commented Jul 11, 2017

The idea of gossipping a min version can work (not in a failsafe way, of course). I'm not convinced that all of that should be done as a side effect of this PR though. Can we decouple this? Make an issue, discuss a bit, remodel what's here with the proposed general mechanism.


Review status: all files reviewed at latest revision, 24 unresolved discussions, some commit checks failed.


Comments from Reviewable

@irfansharif
Copy link
Contributor Author

Can we decouple this? Make an issue, discuss a bit, remodel what's here with the proposed general mechanism.

#16977.

Copy link
Contributor

@a-robinson a-robinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty wary of trying to maintain two raft logs at once while in the transitioning state. We're going to need a lot of error handling code to try to keep them in sync if one rocksdb instance is having problems and the other isn't. It's not quite the same as a distributed consensus problem, but it's still a consensus problem.

Could you perhaps give an instance of how this could come about in practice? Will help me better understand this.

If you're writing to two different engines, what if one of the writes fails (potentially repeatedly, if the disk is in really bad shape) and the other succeeds? Or the process crashes between the two writes? Now your two raft logs/states are out of sync and you need a way to reconcile them. In this context that could probably just mean trusting the engine whose raft log has more recent entries, but it requires a bunch of very sensitive extra code.

@tbg
Copy link
Member

tbg commented Jul 11, 2017

If you're writing to two different engines, what if one of the writes fails (potentially repeatedly, if the disk is in really bad shape) and the other succeeds? Or the process crashes between the two writes? Now your two raft logs/states are out of sync and you need a way to reconcile them. In this context that could probably just mean trusting the engine whose raft log has more recent entries longer one, but it requires a bunch of extra very sensitive code.

I think it's clear that this mode is fragile, but perhaps the "sensitive code" is the one you need for startup in transitioning mode anyway -- you need to harmonize both storages.

BTW, I think we have enough commentary that this is a dicey problem -- what I haven't seen is an alternate proposal. I don't have one.

@a-robinson
Copy link
Contributor

I'd been assuming that the alternative was a hard switch from one engine to the other, with a single copy step (that could be done in either direction for upgrades/downgrades).

@tbg
Copy link
Member

tbg commented Jul 11, 2017

So when upgrading from 1.0 to 1.1, the 1.1 version would use the old engines only (transitioning), then when you enable transitioning mode (as per RFC), it hard copies to the new ones, deletes from the old ones, and uses only the new ones. When restarted without transitioning (however that is signaled), it does it the opposite way.

That seems straightforward and better, thanks for spelling it out for me!

@irfansharif
Copy link
Contributor Author

irfansharif commented Jul 11, 2017

The single hard switch step for upgrades (if I understand correctly) looks as
follows:

  • A first phase rolling restart: the new binaries use the base engines only, as
    before.
  • When signalled (TBD, gated on rfc: version migration for backwards incompatible functionality #16977), hard copy raft data to the new engine
    and delete it from the base, use only the new one going forward.
    We do the opposite for rollbacks in that we use only the raft engine in the
    first phase, and when signalled (again, TBD) hard copy over the raft data to
    the base engine and use just that going forward.

This approach unfortunately does not work for the way things are currently
because of the way we handle log truncations (among other things). Log
truncations happen upstream of raft and go through the proposer evaluated
machinery.

Let's take a single log truncation request, the WriteBatch we populate and
send through raft lists out all the log entries to delete (this is
prohibitively expensive as is, all the more reason to move it downstream of
raft). Consider what would happen if we have a replica located on a node in
the second phase (in that it's operating exclusively on the new raft engine)
with it's raft leader being a replica located on a node in the first phase
(single base engine). The WriteBatch received downstream is addressed to the
single base engine yet has changes to be applied to the raft engine. We'd
have to unpack this WriteBatch and match keys to determine what writes go
where. Given that we use the WriteBatch for all requests, this key matching
will need to take place for all requests. To repeat the obvious, we cannot ask
the node upstream to separately encode raft batch changes as it's only
operating on the single engine. This may be workable
(ReplicatedEvalResult.RaftLogDelta != nil can be used downstream of raft to
detect if log truncations took place, and then match keys and address to the
two engines as appropriate) but is still non-ideal. As for other 'raft data'
that's written upstream of raft, this similarly applies to lastIndex and
HardState written out when setting RHS of splits.

A better approach would be to move log truncations downstream of raft, there
are far fewer interoperability concerns here in doing so in addition to the
performance gain to be had in not plumbing log truncations through raft.
The same needs to be done for writing out the initial HardState during splits
(not just to avoid plumbing raft engine changes through raft, also for
correctness reasons). The interactions with TruncatedState are slightly
unclear (to me at least) but replicas could consult it and delete the log
entries after the fact (TruncatedState doesn't mean the entries have
actually been deleted, but they will be).

(excuse the verbosity, primarily documenting this for myself).

tbg added a commit to tbg/cockroach that referenced this pull request Jul 11, 2017
Sending log truncations through Raft is inefficient: the Raft log is not itself
part of the replicated state. Instead, we only replicate the TruncatedState and,
as a side effect, ClearRange() the affected key range.

This is an individual performance optimization whose impact we should measure;
anecdotally it always looked like we were doing a lot of work for truncations
during a write-heavy workload; this should alleviate this somewhat).

It also removes one migration concern for cockroachdb#16809, see
cockroachdb#16809 (comment).
tbg added a commit to tbg/cockroach that referenced this pull request Jul 11, 2017
Sending log truncations through Raft is inefficient: the Raft log is not itself
part of the replicated state. Instead, we only replicate the TruncatedState and,
as a side effect, ClearRange() the affected key range.

This is an individual performance optimization whose impact we should measure;
anecdotally it always looked like we were doing a lot of work for truncations
during a write-heavy workload; this should alleviate this somewhat).

It also removes one migration concern for cockroachdb#16809, see
cockroachdb#16809 (comment).
tbg added a commit to tbg/cockroach that referenced this pull request Jul 11, 2017
Sending log truncations through Raft is inefficient: the Raft log is not itself
part of the replicated state. Instead, we only replicate the TruncatedState and,
as a side effect, ClearRange() the affected key range.

This is an individual performance optimization whose impact we should measure;
anecdotally it always looked like we were doing a lot of work for truncations
during a write-heavy workload; this should alleviate this somewhat).

It also removes one migration concern for cockroachdb#16809, see
cockroachdb#16809 (comment).
@petermattis
Copy link
Collaborator

Review status: all files reviewed at latest revision, 25 unresolved discussions, some commit checks failed.


pkg/server/config.go, line 267 at r4 (raw file):

	//
	// TransitioningMode is the switch to indicate that we're running in
	// transitioning mode.

I haven't fully grokked or read this PR, but a few thoughts (which you might already have had, apologies but it seemed better to write these down than to assume I'll find time to read the PR soon):

  • A node which has started storing data in the Raft-log specific engine should bump storage/engine/version.go:versionCurrent for the normal engine. This will prevent accidental downgrades to a binary which doesn't support a separate Raft-log engine.
  • Can we transition from the single engine to separate Raft-log engine on a per-Replica basis? I'm imagining something where for each replica we maintain a per-replica key which indicates that the separate Raft-log engine is used. The absence of the key indicates that the single engine is to be used. While transitioning, we have a background process which picks up a replica, locks it appropriately and moves the Raft state. A downgrade mechanism could run that in reverse.
  • I was thinking that usage of the separate Raft-log engine would default to true for new clusters. For existing clusters we would have to explicitly enable it via a cluster setting. A future release could default to enabling the setting by default during upgrade.

Comments from Reviewable

@bdarnell
Copy link
Contributor

Review status: all files reviewed at latest revision, 23 unresolved discussions, some commit checks failed.


pkg/storage/replica.go, line 4406 at r4 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

We may not want users to run in this mode for very long, but they may have to stay in it for a while until they can take on the second round of restarts, and so we need to keep a certain performance baseline here. The code below looks expensive and quadratic: for each log entry we copy basically the whole log.

Yeah, double-writing like this is expensive and introduces consistency concerns between the two engines. I think we need to use double reads instead: The setting controls which engine raft values are written to, and when we read, we read from both and combine them. (The cost of this can be mitigated by a few fields on the Replica to cache the range of log indexes contained in each engine).


Comments from Reviewable

tbg added a commit to tbg/cockroach that referenced this pull request Jul 12, 2017
Sending log truncations through Raft is inefficient: the Raft log is not itself
part of the replicated state. Instead, we only replicate the TruncatedState and,
as a side effect, ClearRange() the affected key range.

This is an individual performance optimization whose impact we should measure;
anecdotally it always looked like we were doing a lot of work for truncations
during a write-heavy workload; this should alleviate this somewhat).

It also removes one migration concern for cockroachdb#16809, see
cockroachdb#16809 (comment).
tbg added a commit to tbg/cockroach that referenced this pull request Jul 17, 2017
Since the move to proposer-evaluated KV, we were potentially clobbering the
HardState on splits as we accidentally moved HardState synthesis upstream of
Raft as well. This change moves it downstream again.

Though not strictly necessary, writing lastIndex was moved as well. This is
cosmetic, though it aids @irfansharif's PR cockroachdb#16809, which moves lastIndex to
the Raft engine. After this PR, neither HardState nor last index keys are
added to the WriteBatch, so that pre-cockroachdb#16993 `TruncateLog` is the only remaining
command that does so (and it, too, won't keep doing that for long).

Note that there is no migration concern.

Fixes cockroachdb#16749.
tbg added a commit to tbg/cockroach that referenced this pull request Jul 17, 2017
Since the move to proposer-evaluated KV, we were potentially clobbering the
HardState on splits as we accidentally moved HardState synthesis upstream of
Raft as well. This change moves it downstream again.

Though not strictly necessary, writing lastIndex was moved as well. This is
cosmetic, though it aids @irfansharif's PR cockroachdb#16809, which moves lastIndex to
the Raft engine. After this PR, neither HardState nor last index keys are
added to the WriteBatch, so that pre-cockroachdb#16993 `TruncateLog` is the only remaining
command that does so (and it, too, won't keep doing that for long).

Note that there is no migration concern.

Fixes cockroachdb#16749.
tbg added a commit to tbg/cockroach that referenced this pull request Jul 17, 2017
Sending log truncations through Raft is inefficient: the Raft log is not itself
part of the replicated state. Instead, we only replicate the TruncatedState and,
as a side effect, ClearRange() the affected key range.

This is an individual performance optimization whose impact we should measure;
anecdotally it always looked like we were doing a lot of work for truncations
during a write-heavy workload; this should alleviate this somewhat).

It also removes one migration concern for cockroachdb#16809, see
cockroachdb#16809 (comment).
tbg added a commit to tbg/cockroach that referenced this pull request Jul 17, 2017
Sending log truncations through Raft is inefficient: the Raft log is not itself
part of the replicated state. Instead, we only replicate the TruncatedState and,
as a side effect, ClearRange() the affected key range.

This is an individual performance optimization whose impact we should measure;
anecdotally it always looked like we were doing a lot of work for truncations
during a write-heavy workload; this should alleviate this somewhat). As
explained above, the change isn't made for performance at this point, though.

It also removes one migration concern for cockroachdb#16809, see
cockroachdb#16809 (comment).

We'll need to migrate this. It's straightforward with the in-flight PR cockroachdb#16977.

- we're moving logic downstream of Raft. However, we can easily migrate it
  upstream again, without a real migration, though I don't think that's going
  to happen.
- the big upshot is hopefully a large reduction in complexity for
  @irfansharif's PR: log truncation is one of the odd cases that requires
  a RaftWriteBatch. cockroachdb#16749 is the only other one, and there the (correct)
  solution also involves going downstream of Raft for a Raft-related write. So,
  after solving both of those, I think RaftWriteBatch can go? cc @irfansharif
- as @petermattis pointed out, after @irfansharif's change, we should be able
  to not sync the base engine on truncation changes but do it only as we
  actually clear the log entries (which can be delayed as we see fit). So for
  1000 log truncations across many ranges, we'll only have to sync once if
  that's how we set it up.
tbg added a commit to tbg/cockroach that referenced this pull request Jul 17, 2017
Since the move to proposer-evaluated KV, we were potentially clobbering the
HardState on splits as we accidentally moved HardState synthesis upstream of
Raft as well. This change moves it downstream again.

Though not strictly necessary, writing lastIndex was moved as well. This is
cosmetic, though it aids @irfansharif's PR cockroachdb#16809, which moves lastIndex to
the Raft engine. After this PR, neither HardState nor last index keys are
added to the WriteBatch, so that pre-cockroachdb#16993 `TruncateLog` is the only remaining
command that does so (and it, too, won't keep doing that for long).

Migration concerns: a lease holder running the new version will propose splits
that don't propose the HardState to Raft. A follower running the old version
will not write the HardState downstream of Raft. In combination, the HardState
would never get written, and would thus be incompatible with the
TruncatedState. Thus, while 1.0 might be around, we're still sending the
potentially dangerous HardState.

Fixes cockroachdb#16749.
tbg added a commit to tbg/cockroach that referenced this pull request Jul 17, 2017
Since the move to proposer-evaluated KV, we were potentially clobbering the
HardState on splits as we accidentally moved HardState synthesis upstream of
Raft as well. This change moves it downstream again.

Though not strictly necessary, writing lastIndex was moved as well. This is
cosmetic, though it aids @irfansharif's PR cockroachdb#16809, which moves lastIndex to
the Raft engine. After this PR, neither HardState nor last index keys are
added to the WriteBatch, so that pre-cockroachdb#16993 `TruncateLog` is the only remaining
command that does so (and it, too, won't keep doing that for long).

Migration concerns: a lease holder running the new version will propose splits
that don't propose the HardState to Raft. A follower running the old version
will not write the HardState downstream of Raft. In combination, the HardState
would never get written, and would thus be incompatible with the
TruncatedState. Thus, while 1.0 might be around, we're still sending the
potentially dangerous HardState.

Fixes cockroachdb#16749.
tbg added a commit to tbg/cockroach that referenced this pull request Jul 18, 2017
Since the move to proposer-evaluated KV, we were potentially clobbering the
HardState on splits as we accidentally moved HardState synthesis upstream of
Raft as well. This change moves it downstream again.

Though not strictly necessary, writing lastIndex was moved as well. This is
cosmetic, though it aids @irfansharif's PR cockroachdb#16809, which moves lastIndex to
the Raft engine. After this PR, neither HardState nor last index keys are
added to the WriteBatch, so that pre-cockroachdb#16993 `TruncateLog` is the only remaining
command that does so (and it, too, won't keep doing that for long).

Migration concerns: a lease holder running the new version will propose splits
that don't propose the HardState to Raft. A follower running the old version
will not write the HardState downstream of Raft. In combination, the HardState
would never get written, and would thus be incompatible with the
TruncatedState. Thus, while 1.0 might be around, we're still sending the
potentially dangerous HardState.

Fixes cockroachdb#16749.
@knz
Copy link
Contributor

knz commented Aug 25, 2017

reminder: don't forget to rebase if you plan to continue work on this, and pick up the new RFC sections + naming convention.

@irfansharif irfansharif deleted the dedicated-raft-storage branch October 22, 2018 22:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants