storage: don't send log truncations through Raft #16993

tbg · 2017-07-11T19:07:00Z

Please discuss - this looks like an easy win. I just whipped this up on a whim,
so I hope I'm not missing something obvious here.

--

Sending log truncations through Raft is inefficient: the Raft log is not itself
part of the replicated state. Instead, we only replicate the TruncatedState and,
as a side effect, ClearRange() the affected key range.

This is an individual performance optimization whose impact we should measure;
anecdotally it always looked like we were doing a lot of work for truncations
during a write-heavy workload; this should alleviate this somewhat).

It also removes one migration concern for #16809, see
#16809 (comment).

cockroach-teamcity · 2017-07-11T19:07:05Z

This change is

petermattis · 2017-07-11T19:16:23Z

Doesn't this add a migration concern? In a mixed-version cluster, a server with this PR will not generate a Raft command that truncates the Raft log for a 1.0.x server.

Review status: 0 of 2 files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.

pkg/storage/replica_proposal.go, line 762 at r1 (raw file):

				keys.RaftLogKey(r.RangeID, newTruncState.Index).PrefixEnd(),
			)
			if err := r.store.engine.ClearRange(start, end); err != nil {

We definitely don't want to use ClearRange here as this would create an unreasonable number of range deletion tombstones. Use ClearIterRange instead.

Comments from Reviewable

tbg · 2017-07-11T19:17:50Z

Doesn't this add a migration concern?

Does it? The below-Raft side effect is optional, and above Raft we can do what we want. The only thing that comes to mind is that when you run a mixed cluster for a long time, replicas running the old version may not actually remove their log entries.

Review status: 0 of 2 files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.

Comments from Reviewable

petermattis · 2017-07-11T19:20:20Z

The only thing that comes to mind is that when you run a mixed cluster for a long time, replicas running the old version may not actually remove their log entries.

That was my concern. That seems like it could be problematic. Maybe I'm not thinking it through properly.

Review status: 0 of 2 files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.

Comments from Reviewable

tbg · 2017-07-11T19:22:18Z

Could be problematic, but my point is rather that it's easy to migrate this change - the "new version" just has to emit old-style truncations as long as it thinks there are still old nodes around. Pretty simple compared to other migrations on the table right now, and should be easy to cover this.

Review status: 0 of 2 files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.

pkg/storage/replica_proposal.go, line 762 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

We definitely don't want to use ClearRange here as this would create an unreasonable number of range deletion tombstones. Use ClearIterRange instead.

Done.

Comments from Reviewable

tamird · 2017-07-11T20:16:26Z

Reviewed 2 of 2 files at r2.
Review status: all files reviewed at latest revision, 4 unresolved discussions, all commit checks successful.

pkg/storage/replica_command.go, line 1908 at r2 (raw file):

	ms, err := iter.ComputeStats(start, end, 0 /* nowNanos */)
	if err != nil {
		return EvalResult{}, errors.Wrap(err, "while computing stats of Raft log freed by truncation")

"could not compute ..." is more in line with our convention, I think.

pkg/storage/replica_proposal.go, line 762 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Done.

Would be good to add a comment explaining why this doesn't ClearRange, which naively seems better.

pkg/storage/replica_proposal.go, line 764 at r2 (raw file):

			iter := r.store.engine.NewIterator(false /* !prefix */)
			if err := r.store.engine.ClearIterRange(iter, start, end); err != nil {
				iter.Close()

extract below, or wrap in a function (instead of the scope) and defer it

pkg/storage/replica_proposal.go, line 776 at r2 (raw file):

		// Truncate the sideloaded storage.
		{

this isn't a real scope, right? could do without it.

Comments from Reviewable

tbg · 2017-07-11T20:46:33Z

Review status: all files reviewed at latest revision, 4 unresolved discussions, all commit checks successful.

pkg/storage/replica_command.go, line 1908 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

"could not compute ..." is more in line with our convention, I think.

Done.

pkg/storage/replica_proposal.go, line 762 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

Would be good to add a comment explaining why this doesn't ClearRange, which naively seems better.

Done.

pkg/storage/replica_proposal.go, line 764 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

extract below, or wrap in a function (instead of the scope) and defer it

Done.

pkg/storage/replica_proposal.go, line 776 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

this isn't a real scope, right? could do without it.

Done.

Comments from Reviewable

tbg · 2017-07-11T20:48:25Z

Review status: 0 of 2 files reviewed at latest revision, 4 unresolved discussions, some commit checks pending.

pkg/storage/replica_proposal.go, line 762 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Done.

(Should have clarified: I added a TODO for Peter to add a comment...)

Comments from Reviewable

tamird · 2017-07-11T20:53:03Z

Reviewed 2 of 2 files at r3.
Review status: all files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.

Comments from Reviewable

petermattis · 2017-07-11T20:56:02Z

Review status: all files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.

pkg/storage/replica_proposal.go, line 764 at r3 (raw file):

			iter := r.store.engine.NewIterator(false /* !prefix */)
			// TODO(petermattis): add a comment explaining why using ClearRange() below
			// is ill-advised.

Range deletion tombstones add a per-query overhead. The RocksDB folks recommend limiting the number that are used. Currently we only use them when deleting an entire range. Add such tombstones for Raft log truncation is likely to result in an excessive number. I should really investigate how the range deletion tombstones are indexed within sstables to understand this better.

Comments from Reviewable

tbg · 2017-07-11T20:59:30Z

Review status: 1 of 2 files reviewed at latest revision, 2 unresolved discussions, some commit checks pending.

pkg/storage/replica_proposal.go, line 764 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Range deletion tombstones add a per-query overhead. The RocksDB folks recommend limiting the number that are used. Currently we only use them when deleting an entire range. Add such tombstones for Raft log truncation is likely to result in an excessive number. I should really investigate how the range deletion tombstones are indexed within sstables to understand this better.

Interested in understanding this better, too. The tombstones we make here are essentially [0,i for increasing i, so it seems that any that hit the same level/sstable/whatever should, in theory, coagulate and leave only one (also seeing that we never "break them up" by writing over them).

Comments from Reviewable

bdarnell · 2017-07-12T17:25:36Z

In addition to Peter's concern about migrating to this change from 1.0, this change moves logic downstream of raft. This will generally make future migrations harder. Maybe it's justifiable in this case, but we'll need to see what the performance numbers are like (or maybe we want to do this for other reasons, like the way it would interact with the migration needed for @irfansharif's change).

Review status: 1 of 2 files reviewed at latest revision, 2 unresolved discussions, all commit checks successful.

Comments from Reviewable

petermattis · 2017-07-12T17:38:30Z

One nice bit this change would enable is we could defer the actual deletion of Raft log entries in order to avoid needing to sync the normal (non-Raft log) engine.

Review status: 1 of 2 files reviewed at latest revision, 2 unresolved discussions, all commit checks successful.

Comments from Reviewable

tbg · 2017-07-12T17:39:52Z

I'll play with some performance tests. We'll see what happens; the original motivation for whipping this up was to facilitate Irfan's migration. Will update, probably next week.

On Wed, Jul 12, 2017 at 1:38 PM Peter Mattis ***@***.***> wrote: One nice bit this change would enable is we could defer the actual deletion of Raft log entries in order to avoid needing to sync the normal (non-Raft log) engine. ------------------------------ Review status: 1 of 2 files reviewed at latest revision, 2 unresolved discussions, all commit checks successful. ------------------------------ *Comments from Reviewable <https://reviewable.io:443/reviews/cockroachdb/cockroach/16993#-:-KorrvUzSEyaAMQZwLoi:bfra9sc>* — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#16993 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE135GI-3qaCCEwzezZS0POjcwAN8A-8ks5sNQSmgaJpZM4OUrll> .

--

…

-- Tobias

petermattis · 2017-07-12T21:30:16Z

Review status: 0 of 2 files reviewed at latest revision, 2 unresolved discussions, all commit checks successful.

pkg/storage/replica_proposal.go, line 764 at r3 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Interested in understanding this better, too. The tombstones we make here are essentially [0,i for increasing i, so it seems that any that hit the same level/sstable/whatever should, in theory, coagulate and leave only one (also seeing that we never "break them up" by writing over them).

Range tombstones are stored in a per-sstable "meta" block that is loaded when the sstable is open. I'm not seeing any code which merges adjacent range tombstones together, though I did run across code which "collapses deletions". That could be it.

When retrieving a key, all of the tombstones contained in each sstable encountered during retrieval are added to a RangeDelAggregator which then answers whether the key has been deleted. It looks like some effort has been put into making RangeDelAggregator efficient, but note that all of the tombstones for a given sstable are added to it even when those tombstones do not overlap the key being retrieved or the span being iterated. I imagine this is where the warning about not using too many tombstones comes from.

Comments from Reviewable

spencerkimball · 2017-07-14T17:20:19Z

pkg/storage/replica_proposal.go

+				keys.RaftLogKey(r.RangeID, newTruncState.Index).PrefixEnd(),
+			)
+			iter := r.store.engine.NewIterator(false /* !prefix */)
+			if err := r.store.engine.ClearIterRange(iter, start, end); err != nil {


This generates per-key tombstones. Why not use ClearRange instead?

nm...i just read the discussion.

tbg · 2017-07-14T17:50:00Z

Review status: 0 of 2 files reviewed at latest revision, 3 unresolved discussions, all commit checks successful.

pkg/storage/replica_proposal.go, line 764 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Range tombstones are stored in a per-sstable "meta" block that is loaded when the sstable is open. I'm not seeing any code which merges adjacent range tombstones together, though I did run across code which "collapses deletions". That could be it.

When retrieving a key, all of the tombstones contained in each sstable encountered during retrieval are added to a RangeDelAggregator which then answers whether the key has been deleted. It looks like some effort has been put into making RangeDelAggregator efficient, but note that all of the tombstones for a given sstable are added to it even when those tombstones do not overlap the key being retrieved or the span being iterated. I imagine this is where the warning about not using too many tombstones comes from.

As an additional datapoint for future archaeologists, SCATTER at one point created 1000s of these tombstones, and it really got RocksDB down to its knees: #16249 (comment)

Comments from Reviewable

tbg · 2017-07-17T15:45:07Z

Ok, I take that back. I don't actually want to play with performance tests just yet. Instead, I first want to discuss the merits of this PR under the assumption that performance stays pretty much the same for now. The relevant points here:

we'll need to migrate this. It's straightforward with the in-flight PR rfc: version migration for backwards incompatible functionality #16977
we're moving logic downstream of Raft. However, we can easily migrate it upstream again, without a real migration, though I don't think that's going to happen.
the big upshot is hopefully a large reduction in complexity for @irfansharif's PR: log truncation is one of the odd cases that requires a RaftWriteBatch. storage: RHS HardState clobbering during splits #16749 is the only other one, and there the (correct) solution also involves going downstream of Raft for a Raft-related write. So, after solving both of those, I think RaftWriteBatch can go? cc @irfansharif
as @petermattis pointed out, after @irfansharif's change, we should be able to not sync the base engine on truncation changes but do it only as we actually clear the log entries (which can be delayed as we see fit). So for 1000 log truncations across many ranges, we'll only have to sync once if that's how we set it up.

For performance, I think it'll be more obvious that it's a pro when the last bullet comes in. Right now, as the change is written, we save only a few on-the-wire bytes. That's nice, but not really something to write home about, and it's hard to demonstrate it, too.

My suggestion is the following:

leave this change open until we have the migration framework in place, at which point I adapt & merge it
@irfansharif rebases his change on top of this (and storage: RHS HardState clobbering during splits #16749 which I'll prioritize)
once his change lands, we optimize the truncation sync as outlined above (punting to 1.2 if we really run out of time)

@bdarnell @petermattis Thoughts?

@irfansharif please confirm that the above plan would really simplify your PR as I imagine.

Review status: 0 of 2 files reviewed at latest revision, 2 unresolved discussions, all commit checks successful.

Comments from Reviewable

bdarnell · 2017-07-17T16:48:23Z

SGTM

Reviewed 2 of 2 files at r4.
Review status: all files reviewed at latest revision, 3 unresolved discussions, all commit checks successful.

pkg/storage/replica_proposal.go, line 762 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

(Should have clarified: I added a TODO for Peter to add a comment...)

Looks like you haven't pushed that TODO.

Comments from Reviewable

irfansharif · 2017-07-17T18:03:48Z

As an aside re: delaying actual log entry deletions, this is what I was alluding to here (comment).

So, after solving both of those, I think RaftWriteBatch can go?

Yup (WriteBatch.RaftData you mean).

So for 1000 log truncations across many ranges, we'll only have to sync once if that's how we set it up.

Yup, this batching and lazy actual deletion will be crucial. I'll be posting to #16624 shortly with some recent numbers but log truncations (and by extension, base engine syncs) happen far too often right now to bring about desired speedups.

@irfansharif please confirm that the above plan would really simplify your PR as I imagine.

Yup.

Review status: all files reviewed at latest revision, 3 unresolved discussions, all commit checks successful.

Comments from Reviewable

tbg · 2017-07-17T18:11:33Z

Cool. PR for #16749 is about to be posted, too.

petermattis · 2017-07-17T18:20:12Z

SGTM too

Review status: all files reviewed at latest revision, 3 unresolved discussions, all commit checks successful.

Comments from Reviewable

@irfansharif

Since the move to proposer-evaluated KV, we were potentially clobbering the HardState on splits as we accidentally moved HardState synthesis upstream of Raft as well. This change moves it downstream again. Though not strictly necessary, writing lastIndex was moved as well. This is cosmetic, though it aids @irfansharif's PR cockroachdb#16809, which moves lastIndex to the Raft engine. After this PR, neither HardState nor last index keys are added to the WriteBatch, so that pre-cockroachdb#16993 `TruncateLog` is the only remaining command that does so (and it, too, won't keep doing that for long). Note that there is no migration concern. Fixes cockroachdb#16749.

tbg · 2017-07-17T21:10:55Z

PTAL: I rebased on top of #17068.

Review status: 0 of 3 files reviewed at latest revision, 3 unresolved discussions.

pkg/storage/replica_proposal.go, line 762 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Looks like you haven't pushed that TODO.

Added a comment.

Comments from Reviewable

petermattis · 2017-07-17T22:32:30Z

Review status: 0 of 3 files reviewed at latest revision, 6 unresolved discussions, all commit checks successful.

pkg/storage/replica_command.go, line 1893 at r5 (raw file):

	// truncation.
	start := engine.MakeMVCCMetadataKey(keys.RaftLogKey(cArgs.EvalCtx.RangeID(), 0))
	end := engine.MakeMVCCMetadataKey(keys.RaftLogKey(cArgs.EvalCtx.RangeID(), args.Index-1).PrefixEnd())

Was the a reason you changed end here? I'm not seeing why it makes a difference.

pkg/storage/replica_command.go, line 1896 at r5 (raw file):

	var ms enginepb.MVCCStats
	if util.IsMigrated() {

@spencerkimball Per my comment in #16977, I think the ergonomics here are somewhat important. I'd like this to look something like a cluster setting:

var inconsistentRaftLogTruncation = settings.RegisterClusterVersionRequirement("1.1-alpha.20170608")
...
if inconsistentRaftLogTruncation.enabled() {
}

pkg/storage/replica_proposal.go, line 773 at r5 (raw file):

			} else {
				iter.Close()
			}

Closing iter on both code paths is a bit awkward. Could restructure as:

  iter := r.store.engine.NewIterator(...)
  err := ...
  iter.Close()
  if err != nil {
    log.Errorf(...)
  }

Comments from Reviewable

tbg · 2017-07-17T22:49:33Z

TFTR! Are you OK'ing the first commit (IsMigrated() as well)?

Review status: 0 of 3 files reviewed at latest revision, 7 unresolved discussions.

pkg/storage/replica_command.go, line 1893 at r5 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Was the a reason you changed end here? I'm not seeing why it makes a difference.

No good reason, changing it back.

pkg/storage/replica_proposal.go, line 773 at r5 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Closing iter on both code paths is a bit awkward. Could restructure as:
  iter := r.store.engine.NewIterator(...)
  err := ...
  iter.Close()
  if err != nil {
    log.Errorf(...)
  }

Done.

Comments from Reviewable

petermattis · 2017-07-17T22:52:24Z

Yes, IsMigrated looks fine in the short term.

Review status: 0 of 3 files reviewed at latest revision, 3 unresolved discussions.

Comments from Reviewable

@irfansharif

Sending log truncations through Raft is inefficient: the Raft log is not itself part of the replicated state. Instead, we only replicate the TruncatedState and, as a side effect, ClearRange() the affected key range. This is an individual performance optimization whose impact we should measure; anecdotally it always looked like we were doing a lot of work for truncations during a write-heavy workload; this should alleviate this somewhat). As explained above, the change isn't made for performance at this point, though. It also removes one migration concern for cockroachdb#16809, see cockroachdb#16809 (comment). We'll need to migrate this. It's straightforward with the in-flight PR cockroachdb#16977. - we're moving logic downstream of Raft. However, we can easily migrate it upstream again, without a real migration, though I don't think that's going to happen. - the big upshot is hopefully a large reduction in complexity for @irfansharif's PR: log truncation is one of the odd cases that requires a RaftWriteBatch. cockroachdb#16749 is the only other one, and there the (correct) solution also involves going downstream of Raft for a Raft-related write. So, after solving both of those, I think RaftWriteBatch can go? cc @irfansharif - as @petermattis pointed out, after @irfansharif's change, we should be able to not sync the base engine on truncation changes but do it only as we actually clear the log entries (which can be delayed as we see fit). So for 1000 log truncations across many ranges, we'll only have to sync once if that's how we set it up.

@irfansharif

Since the move to proposer-evaluated KV, we were potentially clobbering the HardState on splits as we accidentally moved HardState synthesis upstream of Raft as well. This change moves it downstream again. Though not strictly necessary, writing lastIndex was moved as well. This is cosmetic, though it aids @irfansharif's PR cockroachdb#16809, which moves lastIndex to the Raft engine. After this PR, neither HardState nor last index keys are added to the WriteBatch, so that pre-cockroachdb#16993 `TruncateLog` is the only remaining command that does so (and it, too, won't keep doing that for long). Migration concerns: a lease holder running the new version will propose splits that don't propose the HardState to Raft. A follower running the old version will not write the HardState downstream of Raft. In combination, the HardState would never get written, and would thus be incompatible with the TruncatedState. Thus, while 1.0 might be around, we're still sending the potentially dangerous HardState. Fixes cockroachdb#16749.

@irfansharif

Since the move to proposer-evaluated KV, we were potentially clobbering the HardState on splits as we accidentally moved HardState synthesis upstream of Raft as well. This change moves it downstream again. Though not strictly necessary, writing lastIndex was moved as well. This is cosmetic, though it aids @irfansharif's PR cockroachdb#16809, which moves lastIndex to the Raft engine. After this PR, neither HardState nor last index keys are added to the WriteBatch, so that pre-cockroachdb#16993 `TruncateLog` is the only remaining command that does so (and it, too, won't keep doing that for long). Migration concerns: a lease holder running the new version will propose splits that don't propose the HardState to Raft. A follower running the old version will not write the HardState downstream of Raft. In combination, the HardState would never get written, and would thus be incompatible with the TruncatedState. Thus, while 1.0 might be around, we're still sending the potentially dangerous HardState. Fixes cockroachdb#16749.

@irfansharif

Since the move to proposer-evaluated KV, we were potentially clobbering the HardState on splits as we accidentally moved HardState synthesis upstream of Raft as well. This change moves it downstream again. Though not strictly necessary, writing lastIndex was moved as well. This is cosmetic, though it aids @irfansharif's PR cockroachdb#16809, which moves lastIndex to the Raft engine. After this PR, neither HardState nor last index keys are added to the WriteBatch, so that pre-cockroachdb#16993 `TruncateLog` is the only remaining command that does so (and it, too, won't keep doing that for long). Migration concerns: a lease holder running the new version will propose splits that don't propose the HardState to Raft. A follower running the old version will not write the HardState downstream of Raft. In combination, the HardState would never get written, and would thus be incompatible with the TruncatedState. Thus, while 1.0 might be around, we're still sending the potentially dangerous HardState. Fixes cockroachdb#16749.

Raft log truncations currently perform two steps (there may be others, but for the sake of this discussion, let's consider only these two): 1. above raft, they compute the stats of all raft log entries up to the truncation entry. 2. beneath raft, they use ClearIterRange to clear all raft log entries up to the truncation entry. In both steps, operations are performed on all entries up to the truncation entry, and in both steps these operations start from entry 0. A comment added in cockroachdb#16993 gives some idea as to why: > // We start at index zero because it's always possible that a previous > // truncation did not clean up entries made obsolete by the previous > // truncation. My current understanding is that this case where a Raft log has been truncated but its entries not cleaned up is only possible if a node crashes between `applyRaftCommand` and `handleEvalResultRaftMuLocked`. This brings up the question: why don't we truncate raft entries downstream of raft in `applyRaftCommand`? That way, the entries could be deleted atomically with the update to the `RaftTruncatedStateKey` and we wouldn't have to worry about them ever diverging or Raft entries being leaked. That seems like a trivial change, and if that was the case, would the approach here be safe? I don't see a reason why not. For motivation on why we should explore this, I've found that when running `sysbench oltp_insert` on a fresh cluster without pre-splits to measure single range write through, raft log truncation accounts for about 20% of CPU utilization. If we switch the ClearIterRange to a ClearRange downstream of raft, we improve throughput by 13% and reduce the amount of CPU that raft log truncation uses to about 5%. It's obvious why this speeds up the actual truncation itself downstream of raft. The reason why it speeds up the stats computation is less clear, but it may be allowing a RocksDB iterator to more easily skip over the deleted entry keys. If we make the change proposed here, we improve throughput by 28% and reduce the amount of CPU that raft log truncation uses to a negligible amount (< 1%, hard to tell exactly). The reason this speeds both the truncation and the stats computation is because it avoids iterating over RocksDB tombstones for all Raft entries that have ever existed on the range. The throughput improvements are of course exaggerated because we are isolating the throughput of a single range, but they're significant enough to warrant exploration about whether we can make this approach work. Finally, the outsized impact of this small change naturally justifies further exploration. If we could make the change here safe (i.e. if we could depend on replica.FirstIndex() to always be a lower bound on raft log entry keys), could we make similar changes elsewhere? Are there other places where we iterate over an entire raft log keyspace and inadvertently run into all of the deletion tombstones when we could simply skip to the `replica.FirstIndex()`? At a minimum, I believe that `clearRangeData` fits this description, so there may be room to speed up snapshots and replica GC. Release note (performance improvement): Reduce the cost of Raft log truncations and increase single-range throughput.

@tschottdorf

28126: storage: truncate log only between first index and truncate index r=nvanbenschoten a=nvanbenschoten ### Question Raft log truncations currently perform two steps (there may be others, but for the sake of this discussion, let's consider only these two): 1. above raft, they compute the stats of all raft log entries up to the truncation entry. 2. beneath raft, they use ClearIterRange to clear all raft log entries up to the truncation entry. In both steps, operations are performed on all entries up to the truncation entry, and in both steps these operations start from entry 0. A comment added in #16993 gives some idea as to why: > // We start at index zero because it's always possible that a previous > // truncation did not clean up entries made obsolete by the previous > // truncation. My current understanding is that this case where a Raft log has been truncated but its entries not cleaned up is only possible if a node crashes between `applyRaftCommand` and `handleEvalResultRaftMuLocked`. This brings up the question: why don't we truncate raft entries downstream of raft in `applyRaftCommand`? That way, the entries could be deleted atomically with the update to the `RaftTruncatedStateKey` and we wouldn't have to worry about them ever diverging or Raft entries being leaked. That seems like a trivial change, and if that was the case, would the approach here be safe? I don't see a reason why not. ### Motivation For motivation on why we should explore this, I've found that when running `sysbench oltp_insert` on a fresh cluster without pre-splits to measure single range write through, raft log truncation accounts for about **20%** of CPU utilization. <img width="1272" alt="truncate" src="https://user-images.githubusercontent.com/5438456/43502846-bb7a98d2-952a-11e8-9ba0-0b886d3e3ad9.png"> If we switch the ClearIterRange to a ClearRange downstream of raft, we improve throughput by **13%** and reduce the amount of CPU that raft log truncation uses to about **5%**. It's obvious why this speeds up the actual truncation itself downstream of raft. The reason why it speeds up the stats computation is less clear, but it may be allowing a RocksDB iterator to more easily skip over the deleted entry keys. If we make the change proposed here, we improve throughput by **28%** and reduce the amount of CPU that raft log truncation uses to a negligible amount (**< 1%**, hard to tell exactly). The reason this speeds both the truncation and the stats computation is because it avoids iterating over RocksDB tombstones for all Raft entries that have ever existed on the range. The throughput improvements are of course exaggerated because we are isolating the throughput of a single range, but they're significant enough to warrant exploration about whether we can make this approach work. ### Extension Finally, the outsized impact of this small change naturally justifies further exploration. If we could make the change here safe (i.e. if we could depend on replica.FirstIndex() to always be a lower bound on raft log entry keys), could we make similar changes elsewhere? Are there other places where we iterate over an entire raft log keyspace and inadvertently run into all of the deletion tombstones when we could simply skip to the `replica.FirstIndex()`? At a minimum, I believe that `clearRangeData` fits this description, so there may be room to speed up snapshots and replica GC. cc. @tschottdorf @petermattis @benesch Co-authored-by: Nathan VanBenschoten <[email protected]>

tbg requested review from bdarnell and petermattis July 11, 2017 19:07

tbg force-pushed the eval-truncate-log branch from f404e1a to 7a65888 Compare July 11, 2017 19:12

tbg force-pushed the eval-truncate-log branch 2 times, most recently from 22b7835 to 2fac16a Compare July 11, 2017 19:24

tbg force-pushed the eval-truncate-log branch from 2fac16a to a237057 Compare July 11, 2017 20:46

tbg force-pushed the eval-truncate-log branch from a237057 to 7fe132f Compare July 11, 2017 20:57

tbg force-pushed the eval-truncate-log branch from 7fe132f to a894bee Compare July 12, 2017 19:54

spencerkimball reviewed Jul 14, 2017

View reviewed changes

tbg mentioned this pull request Jul 17, 2017

storage: don't clobber HardState on splits #17051

Merged

tbg force-pushed the eval-truncate-log branch from a894bee to b057b20 Compare July 17, 2017 21:06

tbg force-pushed the eval-truncate-log branch from b057b20 to 499c50b Compare July 17, 2017 21:11

tbg force-pushed the eval-truncate-log branch from 499c50b to d35c5f0 Compare July 17, 2017 22:48

tbg mentioned this pull request Jul 17, 2017

util: add migration stub #17068

Merged

tbg force-pushed the eval-truncate-log branch from d35c5f0 to 5898377 Compare July 17, 2017 23:11

tbg force-pushed the eval-truncate-log branch from 5898377 to 47d325e Compare July 17, 2017 23:12

tbg merged commit 6c37d18 into cockroachdb:master Jul 17, 2017

tbg deleted the eval-truncate-log branch July 17, 2017 23:43

irfansharif mentioned this pull request Jul 18, 2017

rfc: version migration for backwards incompatible functionality #16977

Merged

nvanbenschoten mentioned this pull request Aug 1, 2018

storage: truncate log only between first index and truncate index #28126

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: don't send log truncations through Raft #16993

storage: don't send log truncations through Raft #16993

tbg commented Jul 11, 2017

cockroach-teamcity commented Jul 11, 2017

petermattis commented Jul 11, 2017

tbg commented Jul 11, 2017

petermattis commented Jul 11, 2017

tbg commented Jul 11, 2017

tamird commented Jul 11, 2017

tbg commented Jul 11, 2017

tbg commented Jul 11, 2017

tamird commented Jul 11, 2017

petermattis commented Jul 11, 2017

tbg commented Jul 11, 2017

bdarnell commented Jul 12, 2017

petermattis commented Jul 12, 2017

tbg commented Jul 12, 2017 via email

petermattis commented Jul 12, 2017

spencerkimball Jul 14, 2017

spencerkimball Jul 14, 2017

tbg commented Jul 14, 2017

tbg commented Jul 17, 2017

bdarnell commented Jul 17, 2017

irfansharif commented Jul 17, 2017

tbg commented Jul 17, 2017

petermattis commented Jul 17, 2017

tbg commented Jul 17, 2017

petermattis commented Jul 17, 2017

tbg commented Jul 17, 2017

petermattis commented Jul 17, 2017

storage: don't send log truncations through Raft #16993

storage: don't send log truncations through Raft #16993

Conversation

tbg commented Jul 11, 2017

cockroach-teamcity commented Jul 11, 2017

petermattis commented Jul 11, 2017

tbg commented Jul 11, 2017

petermattis commented Jul 11, 2017

tbg commented Jul 11, 2017

tamird commented Jul 11, 2017

tbg commented Jul 11, 2017

tbg commented Jul 11, 2017

tamird commented Jul 11, 2017

petermattis commented Jul 11, 2017

tbg commented Jul 11, 2017

bdarnell commented Jul 12, 2017

petermattis commented Jul 12, 2017

tbg commented Jul 12, 2017 via email

petermattis commented Jul 12, 2017

spencerkimball Jul 14, 2017

Choose a reason for hiding this comment

spencerkimball Jul 14, 2017

Choose a reason for hiding this comment

tbg commented Jul 14, 2017

tbg commented Jul 17, 2017

bdarnell commented Jul 17, 2017

irfansharif commented Jul 17, 2017

tbg commented Jul 17, 2017

petermattis commented Jul 17, 2017

tbg commented Jul 17, 2017

petermattis commented Jul 17, 2017

tbg commented Jul 17, 2017

petermattis commented Jul 17, 2017