-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[wip,dnm] storage: introduced dedicated raft storage #16809
[wip,dnm] storage: introduced dedicated raft storage #16809
Conversation
irfansharif
commented
Jun 30, 2017
•
edited
Loading
edited
note: The point above having older versions upstream of raft is as yet unresolved, I have documented a solution There's no RocksDB tuning here as yet, will follow on after this. |
8e21b33
to
34a75d0
Compare
34a75d0
to
1fc8479
Compare
Yes, the acceptance tests (specifically the ones in I wonder if it would be better to encapsulate most of this at the Engine level: instead of two Engines, we could have a new implementation of the Engine interface that wraps two other Engines and switches between them based on the keys it is given. I had been envisioning the transition as two states instead of three: Whenever a new binary starts up, it is in the "transitioning" state, and it is still possible to roll back to the old version. When the admin changes to the "enabled" state, it is no longer possible to roll back to an older version, although it should be possible to roll back to the "transitioning" state of the new binary. (The old binary is effectively the "disabled" state). The three-state version is nice if we can do it (the fewer restrictions we place on rollbacks, the better), but I worry that it will be tricky to implement and under-tested (once you've transitioned to the "disabled" state, when does it become safe to do the downgrade?) Store.Start is a good place to run a migration if a restart is required to make the change. We'd prefer not to do that (although it's not out of the question). Instead of a migration that copies data from one engine to the other, I wonder if it would be better to just let the truncation process happen naturally (reading from both engines in the meantime). Reviewed 29 of 29 files at r1, 1 of 1 files at r2. pkg/server/config.go, line 406 at r2 (raw file):
I'd generally prefer to return an array of (engine, raftEngine) pairs instead of two parallel slices, unless it's too painful to make that refactoring. The comment on this method needs to be updated with the new return values. pkg/server/config.go, line 411 at r2 (raw file):
This repeated expression should perhaps be encapsulated in some global function. pkg/server/server.go, line 683 at r2 (raw file):
Refer to the configuration variables as little as possible. Here, for example, you should just close As a general rule I'd suggest creating the raft engine (and maybe batches) unconditionally, and the current transition state would only be used to A) decide which batch to write to and B) to skip operations that would impact the overall performance (like committing the write batch). pkg/storage/client_split_test.go, line 1217 at r2 (raw file):
FYI this test has been flaky on master recently, so it may not be your fault. pkg/storage/client_test.go, line 1003 at r2 (raw file):
Remember to remove this before merging. pkg/storage/replica_raftstorage.go, line 166 at r2 (raw file):
Yes, it sounds like a good idea to move the last index to the raft engine. pkg/storage/replica_raftstorage.go, line 180 at r2 (raw file):
I'm less sure about this one but it's probably a good idea to move TruncatedState too. pkg/storage/store.go, line 158 at r2 (raw file):
Yes, this should be a cluster setting (which means that the global variables will have to be changed to functions). Comments from Reviewable |
Can't we make this downgrade path safe indefinitely? When a binary starts up in
Perhaps instead, we should require the explicit TRANSITIONING mode flag so that the extra step removes the env var (that will soon become irrelevant) again. Downgrades work the opposite - set transitioning flag, rolling restart, copy old binaries, rolling restart. Does seem testable enough to me with a handful of acceptance tests. Or am I completely misunderstanding how the up/downgrade process is envisioned?
Maybe my opinion will change as I actually review the code, but on an abstract level I appreciate the explicitness the as-is approach brings. Edit: OK, reviewed the change and the impression remains. Reviewed 29 of 29 files at r1. pkg/server/config.go, line 526 at r1 (raw file):
While you're here, add a comment on this pkg/server/config.go, line 411 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
can't this be unconditional? If there aren't any raft engines, it would be a no-op. pkg/server/config_test.go, line 45 at r1 (raw file):
can't this be unconditional? If there aren't any raft engines, it would be a no-op. pkg/server/config_test.go, line 69 at r1 (raw file):
ditto. pkg/server/node.go, line 387 at r1 (raw file):
s/engines/engine(s)/ pkg/server/server.go, line 683 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
👍 pkg/storage/replica.go, line 4251 at r1 (raw file):
Wouldn't that be the common case? Might be missing something, but my intuition is that RaftData would only be nontrivial for special commands like ChangeReplica, Split, etc. pkg/storage/replica.go, line 4258 at r1 (raw file):
This is where Ben's suggestion looks like it would be more straightforward, but for it to work we'd need to make pkg/storage/replica.go, line 4300 at r1 (raw file):
Add a comment. pkg/storage/replica_command.go, line 3347 at r1 (raw file):
You could be more specific here by using (basically) pkg/storage/replica_data_iter.go, line 56 at r1 (raw file):
Only needs a pkg/storage/replica_raftstorage.go, line 64 at r1 (raw file):
Moving the comment one line up seems spurious. The comment refers to the return statement (perhaps just move it there). pkg/storage/replica_raftstorage.go, line 544 at r1 (raw file):
Your usual stats comment is missing. pkg/storage/replica_raftstorage.go, line 638 at r1 (raw file):
Looks like this comment was left accidentally. pkg/storage/replica_raftstorage.go, line 166 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
👍 Also seems like this change is the one to do it. pkg/storage/replica_raftstorage.go, line 180 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
That would be good in order to facilitate faster log truncation, too. We currently shove all of that through Raft, but there's really no good reason afaict. The lease holder could just truncate its logs when it feels like it and let the followers know what to truncate to outside of Raft. If this PR were able to make the truncated index unreplicated (enough) to do this as a follow-up, that would be good; the transitioning state is a good opportunity for these kinds of things. Have to think it through, though. pkg/storage/replica_raftstorage_test.go, line 116 at r1 (raw file):
The benchmark name is pretty generic. Wouldn't hurt to mention you're doing serial puts; it's arguable whether this benchmark is really specific to raft storage. But this makes me think: when in the new mode, what do all the other benchmarks run with? Will they need updates as well to run with "realistic" Raft engines, too? pkg/storage/replica_raftstorage_test.go, line 118 at r1 (raw file):
nit: pkg/storage/store.go, line 137 at r1 (raw file):
nit: s/stored/stores/ s/log entries/such as log entries/ (anticipating that we might add lastIndex, and perhaps truncated state) pkg/storage/store.go, line 178 at r1 (raw file):
Grammar is off throughout this sentence. pkg/storage/store.go, line 1256 at r1 (raw file):
Or if we were on a new version before, need to copy from dedicated engine to base engine. pkg/storage/store.go, line 2428 at r1 (raw file):
This isn't really right, but probably the most straightforward way to handle it for now. Should make sure it isn't getting lost, though. Add a TODO? pkg/storage/store.go, line 158 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
What is your plan with the cluster setting, Ben? Nodes will need to rewrite on-disk data which is hard to do once they're running. Would the idea be that nodes store the "last" cluster setting they've seen and apply it on restart? It's tricky to guarantee that all nodes have seen the latest setting, a problem you don't have with an env var. We could provide a helper - pkg/storage/engine/rocksdb.go, line 324 at r1 (raw file):
I'm always wary of this pattern. If you pass in a Comments from Reviewable |
I had been thinking along the same lines as @bdarnell, expecting two states rather than three. Some of my thoughts:
Also, could you expand on what you mean by the below snippet from your commit message? I take it that the cluster would have to start tracking which nodes are in which modes?
|
1fc8479
to
f21ad2f
Compare
I still prefer the explicit as-is approach as opposed to making a two-pronged
I have the single
As for the remaining migration related concerns, here's what I have distilled it to:
The rolling restarts (upgrades or version rollbacks) is now a two step process, looks as follows:
Gauging from a few kubernetes docs on rolling version changes this seems to be the best thing to do for most deployments. It's scriptable too as we can immediately run the second rolling restart after the success of the first (indicating we have all nodes running in transitioning mode). The same applies for version rollbacks.
The explicit decision I've made here is to punt off the global view of the cluster (what versions are all the nodes in the cluster running?) to the operator.
Could you perhaps give an instance of how this could come about in practice? Will help me better understand this. Review status: 2 of 34 files reviewed at latest revision, 27 unresolved discussions, some commit checks failed. pkg/server/config.go, line 526 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/server/config.go, line 406 at r2 (raw file):
I did try this and it's very pervasive, so skipped for now.
pkg/server/config.go, line 411 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Removed pkg/server/config_test.go, line 45 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Removed pkg/server/config_test.go, line 69 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Ditto. pkg/server/node.go, line 387 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/server/server.go, line 683 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. Removing
We can't skip operations for performance here, we will need to commit the write batches on both engines for transitioning mode. pkg/storage/client_test.go, line 1003 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
woops, Done. pkg/storage/replica.go, line 4251 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
It is the common case, yes, but pkg/storage/replica.go, line 4258 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
I still prefer the explicit as-is approach as opposed to making a two-pronged pkg/storage/replica.go, line 4300 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Welp, this was far too subtle for me to have missed explaining it. Done. pkg/storage/replica_command.go, line 3347 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
no that's a good idea, this was a relic of pre pkg/storage/replica_data_iter.go, line 56 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done, turns out I needed it as well. pkg/storage/replica_raftstorage.go, line 64 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/storage/replica_raftstorage.go, line 544 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/storage/replica_raftstorage.go, line 638 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
no actually, though a repetition of the same comment above it applies here given our usage of pkg/storage/replica_raftstorage.go, line 166 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done, was refreshingly easy to do so. pkg/storage/replica_raftstorage.go, line 180 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
yup, would like to keep this in a separate PR and possibly refactor pkg/storage/replica_raftstorage_test.go, line 116 at r1 (raw file): Renamed to
pkg/storage/replica_raftstorage_test.go, line 118 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/storage/store.go, line 137 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/storage/store.go, line 178 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
doh, Done. pkg/storage/store.go, line 1256 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
hmm, PTAL here. I tried a couple of ideas here to detect in if we're indeed upgrading or downgrading but it got too hacky, settled for a dumber implementation instead. Update: ugh, this isn't completely correct, haven't had a chance to update this. One thing I have not addressed is the clean up of the pkg/storage/store.go, line 2428 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. pkg/storage/store.go, line 158 at r2 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
continuing this discussion up above. pkg/storage/client_split_test.go, line 1217 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Removed in lieu of #16893. Comments from Reviewable |
BenchmarkSerialPuts benchmarks write performance for different payload sizes on an on disk RocksDB instance.
Implements cockroachdb#16361. This is a breaking change. To see why consider that prior to this we stored all consensus data in addition to all system metadata and user level keys in the same, single RocksDB instance. Here we introduce a separate, dedicated instance for raft data (log entries and HardState). Cockroach nodes simply restarting with these changes, unless migrated properly, will fail to find the most recent raft long entries and HardState data in the new RocksDB instance. Also consider a cluster running mixed versions (nodes with dedicated raft storage and nodes without), what would the communication between nodes here like in light of proposer evaluated KV? Current we propagate a storagebase.WriteBatch through raft containing a serialized representation of a RocksDB write batch, this models the changes to be made to the single underlying RocksDB instance. For log truncation requests where we delete log entries and/or admin splits where we write initial HardState for newly formed replicas, we need to similarly propagate a write batch (through raft) addressing the new RocksDB instance (if the recipient node is one with these changes) or the original RocksDB instance (if the recipient node is one without these changes). What if an older version node is the raft leader and is therefore the one upstream of raft, propagating storagebase.WriteBatches with raft data changes but addressed to the original RocksDB instance? What would rollbacks look like? To this end we introduce three modes of operation, transitioningRaftStorage and enabledRaftStorage (this is implicit if we're not in transitioning mode). We've made it so that it is safe to transition between an older cockroach version to transitioningRaftStorage, from transitioningRaftStorage to enabled and the reverse for rollbacks. Transition from one mode to the next will take place when all the nodes in the cluster are on the same previous mode. The operation mode is set by an env var COCKROACH_DEDICATED_RAFT_STORAGE={DISABLED,TRANSITIONING,ENABLED} - In the old version we use a single RocksDB instance for both raft and user-level KV data - In transitioningRaftStorage mode we use both RocksDB instances for raft data interoperably, the raft specific and the regular instance. We use this mode to facilitate rolling upgrades - In enabled mode we use the dedicated RocksDB instance for raft data. Raft log entries and the HardState are stored on this instance alone Most of this commit is careful plumbing of an extra engine.{Engine,Batch,Reader,Writer,ReadWriter} for whenever we need to interact with the new RocksDB instance.
- Include lastIndex in raft engine - Address backwards compatibility and preserve safety for clusters running multiple versions - Introduce --transitioning flag - Address review comments
f21ad2f
to
20a3981
Compare
Your change is effective at avoiding that people run clusters in the slow transitioning mode for extended periods of time. It's however not effective at making sure people don't screw up the upgrade (by forgetting to set the flag). Can we make it so that the new version won't start without Reviewed 30 of 34 files at r3, 2 of 3 files at r4. pkg/server/config.go, line 406 at r2 (raw file): Previously, irfansharif (irfan sharif) wrote…
In the meantime, named return values (used only for the names) could help here:
pkg/server/config.go, line 239 at r4 (raw file):
We're talking about 1.1->1.2, which is a minor update, not major. pkg/server/config.go, line 522 at r4 (raw file):
This comment belongs above pkg/server/server_test.go, line 440 at r4 (raw file):
reminder and also perhaps https://stackoverflow.com/questions/40027067/cannot-resolve-local-hostname-after-upgrading-to-macos-sierra helps. pkg/storage/replica.go, line 3190 at r4 (raw file):
Hope that rebase wasn't too obnoxious. pkg/storage/replica.go, line 4397 at r4 (raw file):
s/is in/it is in/ pkg/storage/replica.go, line 4406 at r4 (raw file):
We may not want users to run in this mode for very long, but they may have to stay in it for a while until they can take on the second round of restarts, and so we need to keep a certain performance baseline here. The code below looks expensive and quadratic: for each log entry we copy basically the whole log. pkg/storage/replica.go, line 4443 at r4 (raw file):
s/eng/raft engine/ pkg/storage/replica.go, line 4448 at r4 (raw file):
s/eng/data engine/ pkg/storage/replica_data_iter.go, line 60 at r4 (raw file):
while you're here, consider satisfying my OCD: pkg/storage/store.go, line 178 at r1 (raw file):
pkg/storage/store.go, line 2428 at r1 (raw file): Previously, irfansharif (irfan sharif) wrote…
nit: empty comment line between comment and TODO. pkg/storage/store.go, line 143 at r4 (raw file):
s/operating/operate/ pkg/storage/store.go, line 1207 at r4 (raw file):
Where's the pkg/storage/store.go, line 1209 at r4 (raw file):
Passing pkg/storage/store.go, line 4024 at r4 (raw file):
The method's name is more general than the description suggests. pkg/storage/store.go, line 4046 at r4 (raw file):
Would just eliminate pkg/storage/store.go, line 4051 at r4 (raw file):
But they could both have a Comments from Reviewable |
That's a good point I hadn't thought of before. I guess there's no reason to require a 1:1 mapping of these engines. I agree with @tschottdorf that requiring the flag to be added atomically with the new binary rollout seems error prone. I think new nodes should start in "transitioning" mode until they get the signal that it's safe to use the new mode. I don't like using a command-line change and a rolling restart (which is still at least a little disruptive) to implement this signal. I think it should be something gossiped instead (probably a "min version" cluster setting that can be shared along with other migrations, as we discussed with @spencerkimball and @a-robinson this morning). One advantage of the rolling restart is that it spreads the change out over time, whereas the cluster setting takes effect more or less immediately. We may want to have nodes add a random delay from the time they get the min-version gossip before it takes effect. Reviewed 19 of 34 files at r3. Comments from Reviewable |
The idea of gossipping a min version can work (not in a failsafe way, of course). I'm not convinced that all of that should be done as a side effect of this PR though. Can we decouple this? Make an issue, discuss a bit, remodel what's here with the proposed general mechanism. Review status: all files reviewed at latest revision, 24 unresolved discussions, some commit checks failed. Comments from Reviewable |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty wary of trying to maintain two raft logs at once while in the transitioning state. We're going to need a lot of error handling code to try to keep them in sync if one rocksdb instance is having problems and the other isn't. It's not quite the same as a distributed consensus problem, but it's still a consensus problem.
Could you perhaps give an instance of how this could come about in practice? Will help me better understand this.
If you're writing to two different engines, what if one of the writes fails (potentially repeatedly, if the disk is in really bad shape) and the other succeeds? Or the process crashes between the two writes? Now your two raft logs/states are out of sync and you need a way to reconcile them. In this context that could probably just mean trusting the engine whose raft log has more recent entries, but it requires a bunch of very sensitive extra code.
I think it's clear that this mode is fragile, but perhaps the "sensitive code" is the one you need for startup in transitioning mode anyway -- you need to harmonize both storages. BTW, I think we have enough commentary that this is a dicey problem -- what I haven't seen is an alternate proposal. I don't have one. |
I'd been assuming that the alternative was a hard switch from one engine to the other, with a single copy step (that could be done in either direction for upgrades/downgrades). |
So when upgrading from 1.0 to 1.1, the 1.1 version would use the old engines only (transitioning), then when you enable transitioning mode (as per RFC), it hard copies to the new ones, deletes from the old ones, and uses only the new ones. When restarted without transitioning (however that is signaled), it does it the opposite way. That seems straightforward and better, thanks for spelling it out for me! |
The single hard switch step for upgrades (if I understand correctly) looks as
This approach unfortunately does not work for the way things are currently Let's take a single log truncation request, the A better approach would be to move log truncations downstream of raft, there (excuse the verbosity, primarily documenting this for myself). |
Sending log truncations through Raft is inefficient: the Raft log is not itself part of the replicated state. Instead, we only replicate the TruncatedState and, as a side effect, ClearRange() the affected key range. This is an individual performance optimization whose impact we should measure; anecdotally it always looked like we were doing a lot of work for truncations during a write-heavy workload; this should alleviate this somewhat). It also removes one migration concern for cockroachdb#16809, see cockroachdb#16809 (comment).
Sending log truncations through Raft is inefficient: the Raft log is not itself part of the replicated state. Instead, we only replicate the TruncatedState and, as a side effect, ClearRange() the affected key range. This is an individual performance optimization whose impact we should measure; anecdotally it always looked like we were doing a lot of work for truncations during a write-heavy workload; this should alleviate this somewhat). It also removes one migration concern for cockroachdb#16809, see cockroachdb#16809 (comment).
Sending log truncations through Raft is inefficient: the Raft log is not itself part of the replicated state. Instead, we only replicate the TruncatedState and, as a side effect, ClearRange() the affected key range. This is an individual performance optimization whose impact we should measure; anecdotally it always looked like we were doing a lot of work for truncations during a write-heavy workload; this should alleviate this somewhat). It also removes one migration concern for cockroachdb#16809, see cockroachdb#16809 (comment).
Review status: all files reviewed at latest revision, 25 unresolved discussions, some commit checks failed. pkg/server/config.go, line 267 at r4 (raw file):
I haven't fully grokked or read this PR, but a few thoughts (which you might already have had, apologies but it seemed better to write these down than to assume I'll find time to read the PR soon):
Comments from Reviewable |
Review status: all files reviewed at latest revision, 23 unresolved discussions, some commit checks failed. pkg/storage/replica.go, line 4406 at r4 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Yeah, double-writing like this is expensive and introduces consistency concerns between the two engines. I think we need to use double reads instead: The setting controls which engine raft values are written to, and when we read, we read from both and combine them. (The cost of this can be mitigated by a few fields on the Replica to cache the range of log indexes contained in each engine). Comments from Reviewable |
Sending log truncations through Raft is inefficient: the Raft log is not itself part of the replicated state. Instead, we only replicate the TruncatedState and, as a side effect, ClearRange() the affected key range. This is an individual performance optimization whose impact we should measure; anecdotally it always looked like we were doing a lot of work for truncations during a write-heavy workload; this should alleviate this somewhat). It also removes one migration concern for cockroachdb#16809, see cockroachdb#16809 (comment).
Since the move to proposer-evaluated KV, we were potentially clobbering the HardState on splits as we accidentally moved HardState synthesis upstream of Raft as well. This change moves it downstream again. Though not strictly necessary, writing lastIndex was moved as well. This is cosmetic, though it aids @irfansharif's PR cockroachdb#16809, which moves lastIndex to the Raft engine. After this PR, neither HardState nor last index keys are added to the WriteBatch, so that pre-cockroachdb#16993 `TruncateLog` is the only remaining command that does so (and it, too, won't keep doing that for long). Note that there is no migration concern. Fixes cockroachdb#16749.
Since the move to proposer-evaluated KV, we were potentially clobbering the HardState on splits as we accidentally moved HardState synthesis upstream of Raft as well. This change moves it downstream again. Though not strictly necessary, writing lastIndex was moved as well. This is cosmetic, though it aids @irfansharif's PR cockroachdb#16809, which moves lastIndex to the Raft engine. After this PR, neither HardState nor last index keys are added to the WriteBatch, so that pre-cockroachdb#16993 `TruncateLog` is the only remaining command that does so (and it, too, won't keep doing that for long). Note that there is no migration concern. Fixes cockroachdb#16749.
Sending log truncations through Raft is inefficient: the Raft log is not itself part of the replicated state. Instead, we only replicate the TruncatedState and, as a side effect, ClearRange() the affected key range. This is an individual performance optimization whose impact we should measure; anecdotally it always looked like we were doing a lot of work for truncations during a write-heavy workload; this should alleviate this somewhat). It also removes one migration concern for cockroachdb#16809, see cockroachdb#16809 (comment).
Sending log truncations through Raft is inefficient: the Raft log is not itself part of the replicated state. Instead, we only replicate the TruncatedState and, as a side effect, ClearRange() the affected key range. This is an individual performance optimization whose impact we should measure; anecdotally it always looked like we were doing a lot of work for truncations during a write-heavy workload; this should alleviate this somewhat). As explained above, the change isn't made for performance at this point, though. It also removes one migration concern for cockroachdb#16809, see cockroachdb#16809 (comment). We'll need to migrate this. It's straightforward with the in-flight PR cockroachdb#16977. - we're moving logic downstream of Raft. However, we can easily migrate it upstream again, without a real migration, though I don't think that's going to happen. - the big upshot is hopefully a large reduction in complexity for @irfansharif's PR: log truncation is one of the odd cases that requires a RaftWriteBatch. cockroachdb#16749 is the only other one, and there the (correct) solution also involves going downstream of Raft for a Raft-related write. So, after solving both of those, I think RaftWriteBatch can go? cc @irfansharif - as @petermattis pointed out, after @irfansharif's change, we should be able to not sync the base engine on truncation changes but do it only as we actually clear the log entries (which can be delayed as we see fit). So for 1000 log truncations across many ranges, we'll only have to sync once if that's how we set it up.
Since the move to proposer-evaluated KV, we were potentially clobbering the HardState on splits as we accidentally moved HardState synthesis upstream of Raft as well. This change moves it downstream again. Though not strictly necessary, writing lastIndex was moved as well. This is cosmetic, though it aids @irfansharif's PR cockroachdb#16809, which moves lastIndex to the Raft engine. After this PR, neither HardState nor last index keys are added to the WriteBatch, so that pre-cockroachdb#16993 `TruncateLog` is the only remaining command that does so (and it, too, won't keep doing that for long). Migration concerns: a lease holder running the new version will propose splits that don't propose the HardState to Raft. A follower running the old version will not write the HardState downstream of Raft. In combination, the HardState would never get written, and would thus be incompatible with the TruncatedState. Thus, while 1.0 might be around, we're still sending the potentially dangerous HardState. Fixes cockroachdb#16749.
Since the move to proposer-evaluated KV, we were potentially clobbering the HardState on splits as we accidentally moved HardState synthesis upstream of Raft as well. This change moves it downstream again. Though not strictly necessary, writing lastIndex was moved as well. This is cosmetic, though it aids @irfansharif's PR cockroachdb#16809, which moves lastIndex to the Raft engine. After this PR, neither HardState nor last index keys are added to the WriteBatch, so that pre-cockroachdb#16993 `TruncateLog` is the only remaining command that does so (and it, too, won't keep doing that for long). Migration concerns: a lease holder running the new version will propose splits that don't propose the HardState to Raft. A follower running the old version will not write the HardState downstream of Raft. In combination, the HardState would never get written, and would thus be incompatible with the TruncatedState. Thus, while 1.0 might be around, we're still sending the potentially dangerous HardState. Fixes cockroachdb#16749.
Since the move to proposer-evaluated KV, we were potentially clobbering the HardState on splits as we accidentally moved HardState synthesis upstream of Raft as well. This change moves it downstream again. Though not strictly necessary, writing lastIndex was moved as well. This is cosmetic, though it aids @irfansharif's PR cockroachdb#16809, which moves lastIndex to the Raft engine. After this PR, neither HardState nor last index keys are added to the WriteBatch, so that pre-cockroachdb#16993 `TruncateLog` is the only remaining command that does so (and it, too, won't keep doing that for long). Migration concerns: a lease holder running the new version will propose splits that don't propose the HardState to Raft. A follower running the old version will not write the HardState downstream of Raft. In combination, the HardState would never get written, and would thus be incompatible with the TruncatedState. Thus, while 1.0 might be around, we're still sending the potentially dangerous HardState. Fixes cockroachdb#16749.
reminder: don't forget to rebase if you plan to continue work on this, and pick up the new RFC sections + naming convention. |