-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: don't send log truncations through Raft #16993
Conversation
Doesn't this add a migration concern? In a mixed-version cluster, a server with this PR will not generate a Raft command that truncates the Raft log for a 1.0.x server. Review status: 0 of 2 files reviewed at latest revision, 1 unresolved discussion, some commit checks pending. pkg/storage/replica_proposal.go, line 762 at r1 (raw file):
We definitely don't want to use Comments from Reviewable |
Does it? The below-Raft side effect is optional, and above Raft we can do what we want. The only thing that comes to mind is that when you run a mixed cluster for a long time, replicas running the old version may not actually remove their log entries. Review status: 0 of 2 files reviewed at latest revision, 1 unresolved discussion, some commit checks pending. Comments from Reviewable |
That was my concern. That seems like it could be problematic. Maybe I'm not thinking it through properly. Review status: 0 of 2 files reviewed at latest revision, 1 unresolved discussion, some commit checks pending. Comments from Reviewable |
Could be problematic, but my point is rather that it's easy to migrate this change - the "new version" just has to emit old-style truncations as long as it thinks there are still old nodes around. Pretty simple compared to other migrations on the table right now, and should be easy to cover this. Review status: 0 of 2 files reviewed at latest revision, 1 unresolved discussion, some commit checks pending. pkg/storage/replica_proposal.go, line 762 at r1 (raw file): Previously, petermattis (Peter Mattis) wrote…
Done. Comments from Reviewable |
22b7835
to
2fac16a
Compare
Reviewed 2 of 2 files at r2. pkg/storage/replica_command.go, line 1908 at r2 (raw file):
"could not compute ..." is more in line with our convention, I think. pkg/storage/replica_proposal.go, line 762 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Would be good to add a comment explaining why this doesn't pkg/storage/replica_proposal.go, line 764 at r2 (raw file):
extract below, or wrap in a function (instead of the scope) and defer it pkg/storage/replica_proposal.go, line 776 at r2 (raw file):
this isn't a real scope, right? could do without it. Comments from Reviewable |
Review status: all files reviewed at latest revision, 4 unresolved discussions, all commit checks successful. pkg/storage/replica_command.go, line 1908 at r2 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. pkg/storage/replica_proposal.go, line 762 at r1 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. pkg/storage/replica_proposal.go, line 764 at r2 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. pkg/storage/replica_proposal.go, line 776 at r2 (raw file): Previously, tamird (Tamir Duberstein) wrote…
Done. Comments from Reviewable |
Review status: 0 of 2 files reviewed at latest revision, 4 unresolved discussions, some commit checks pending. pkg/storage/replica_proposal.go, line 762 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
(Should have clarified: I added a TODO for Peter to add a comment...) Comments from Reviewable |
Reviewed 2 of 2 files at r3. Comments from Reviewable |
Review status: all files reviewed at latest revision, 2 unresolved discussions, some commit checks failed. pkg/storage/replica_proposal.go, line 764 at r3 (raw file):
Range deletion tombstones add a per-query overhead. The RocksDB folks recommend limiting the number that are used. Currently we only use them when deleting an entire range. Add such tombstones for Raft log truncation is likely to result in an excessive number. I should really investigate how the range deletion tombstones are indexed within sstables to understand this better. Comments from Reviewable |
Review status: 1 of 2 files reviewed at latest revision, 2 unresolved discussions, some commit checks pending. pkg/storage/replica_proposal.go, line 764 at r3 (raw file): Previously, petermattis (Peter Mattis) wrote…
Interested in understanding this better, too. The tombstones we make here are essentially Comments from Reviewable |
In addition to Peter's concern about migrating to this change from 1.0, this change moves logic downstream of raft. This will generally make future migrations harder. Maybe it's justifiable in this case, but we'll need to see what the performance numbers are like (or maybe we want to do this for other reasons, like the way it would interact with the migration needed for @irfansharif's change). Review status: 1 of 2 files reviewed at latest revision, 2 unresolved discussions, all commit checks successful. Comments from Reviewable |
One nice bit this change would enable is we could defer the actual deletion of Raft log entries in order to avoid needing to sync the normal (non-Raft log) engine. Review status: 1 of 2 files reviewed at latest revision, 2 unresolved discussions, all commit checks successful. Comments from Reviewable |
I'll play with some performance tests. We'll see what happens; the original
motivation for whipping this up was to facilitate Irfan's migration. Will
update, probably next week.
On Wed, Jul 12, 2017 at 1:38 PM Peter Mattis ***@***.***> wrote:
One nice bit this change would enable is we could defer the actual
deletion of Raft log entries in order to avoid needing to sync the normal
(non-Raft log) engine.
------------------------------
Review status: 1 of 2 files reviewed at latest revision, 2 unresolved
discussions, all commit checks successful.
------------------------------
*Comments from Reviewable
<https://reviewable.io:443/reviews/cockroachdb/cockroach/16993#-:-KorrvUzSEyaAMQZwLoi:bfra9sc>*
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#16993 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE135GI-3qaCCEwzezZS0POjcwAN8A-8ks5sNQSmgaJpZM4OUrll>
.
--
…-- Tobias
|
Review status: 0 of 2 files reviewed at latest revision, 2 unresolved discussions, all commit checks successful. pkg/storage/replica_proposal.go, line 764 at r3 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Range tombstones are stored in a per-sstable "meta" block that is loaded when the sstable is open. I'm not seeing any code which merges adjacent range tombstones together, though I did run across code which "collapses deletions". That could be it. When retrieving a key, all of the tombstones contained in each sstable encountered during retrieval are added to a Comments from Reviewable |
pkg/storage/replica_proposal.go
Outdated
keys.RaftLogKey(r.RangeID, newTruncState.Index).PrefixEnd(), | ||
) | ||
iter := r.store.engine.NewIterator(false /* !prefix */) | ||
if err := r.store.engine.ClearIterRange(iter, start, end); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This generates per-key tombstones. Why not use ClearRange
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nm...i just read the discussion.
Review status: 0 of 2 files reviewed at latest revision, 3 unresolved discussions, all commit checks successful. pkg/storage/replica_proposal.go, line 764 at r3 (raw file): Previously, petermattis (Peter Mattis) wrote…
As an additional datapoint for future archaeologists, SCATTER at one point created 1000s of these tombstones, and it really got RocksDB down to its knees: #16249 (comment) Comments from Reviewable |
Ok, I take that back. I don't actually want to play with performance tests just yet. Instead, I first want to discuss the merits of this PR under the assumption that performance stays pretty much the same for now. The relevant points here:
For performance, I think it'll be more obvious that it's a pro when the last bullet comes in. Right now, as the change is written, we save only a few on-the-wire bytes. That's nice, but not really something to write home about, and it's hard to demonstrate it, too. My suggestion is the following:
@bdarnell @petermattis Thoughts? @irfansharif please confirm that the above plan would really simplify your PR as I imagine. Review status: 0 of 2 files reviewed at latest revision, 2 unresolved discussions, all commit checks successful. Comments from Reviewable |
SGTM Reviewed 2 of 2 files at r4. pkg/storage/replica_proposal.go, line 762 at r1 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Looks like you haven't pushed that TODO. Comments from Reviewable |
As an aside re: delaying actual log entry deletions, this is what I was alluding to here (comment).
Yup (
Yup, this batching and lazy actual deletion will be crucial. I'll be posting to #16624 shortly with some recent numbers but log truncations (and by extension, base engine syncs) happen far too often right now to bring about desired speedups.
Yup. Review status: all files reviewed at latest revision, 3 unresolved discussions, all commit checks successful. Comments from Reviewable |
Cool. PR for #16749 is about to be posted, too. |
SGTM too Review status: all files reviewed at latest revision, 3 unresolved discussions, all commit checks successful. Comments from Reviewable |
Since the move to proposer-evaluated KV, we were potentially clobbering the HardState on splits as we accidentally moved HardState synthesis upstream of Raft as well. This change moves it downstream again. Though not strictly necessary, writing lastIndex was moved as well. This is cosmetic, though it aids @irfansharif's PR cockroachdb#16809, which moves lastIndex to the Raft engine. After this PR, neither HardState nor last index keys are added to the WriteBatch, so that pre-cockroachdb#16993 `TruncateLog` is the only remaining command that does so (and it, too, won't keep doing that for long). Note that there is no migration concern. Fixes cockroachdb#16749.
PTAL: I rebased on top of #17068. Review status: 0 of 3 files reviewed at latest revision, 3 unresolved discussions. pkg/storage/replica_proposal.go, line 762 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Added a comment. Comments from Reviewable |
Review status: 0 of 3 files reviewed at latest revision, 6 unresolved discussions, all commit checks successful. pkg/storage/replica_command.go, line 1893 at r5 (raw file):
Was the a reason you changed pkg/storage/replica_command.go, line 1896 at r5 (raw file):
@spencerkimball Per my comment in #16977, I think the ergonomics here are somewhat important. I'd like this to look something like a cluster setting:
pkg/storage/replica_proposal.go, line 773 at r5 (raw file):
Closing
Comments from Reviewable |
TFTR! Are you OK'ing the first commit ( Review status: 0 of 3 files reviewed at latest revision, 7 unresolved discussions. pkg/storage/replica_command.go, line 1893 at r5 (raw file): Previously, petermattis (Peter Mattis) wrote…
No good reason, changing it back. pkg/storage/replica_proposal.go, line 773 at r5 (raw file): Previously, petermattis (Peter Mattis) wrote…
Done. Comments from Reviewable |
Yes, Review status: 0 of 3 files reviewed at latest revision, 3 unresolved discussions. Comments from Reviewable |
Sending log truncations through Raft is inefficient: the Raft log is not itself part of the replicated state. Instead, we only replicate the TruncatedState and, as a side effect, ClearRange() the affected key range. This is an individual performance optimization whose impact we should measure; anecdotally it always looked like we were doing a lot of work for truncations during a write-heavy workload; this should alleviate this somewhat). As explained above, the change isn't made for performance at this point, though. It also removes one migration concern for cockroachdb#16809, see cockroachdb#16809 (comment). We'll need to migrate this. It's straightforward with the in-flight PR cockroachdb#16977. - we're moving logic downstream of Raft. However, we can easily migrate it upstream again, without a real migration, though I don't think that's going to happen. - the big upshot is hopefully a large reduction in complexity for @irfansharif's PR: log truncation is one of the odd cases that requires a RaftWriteBatch. cockroachdb#16749 is the only other one, and there the (correct) solution also involves going downstream of Raft for a Raft-related write. So, after solving both of those, I think RaftWriteBatch can go? cc @irfansharif - as @petermattis pointed out, after @irfansharif's change, we should be able to not sync the base engine on truncation changes but do it only as we actually clear the log entries (which can be delayed as we see fit). So for 1000 log truncations across many ranges, we'll only have to sync once if that's how we set it up.
Since the move to proposer-evaluated KV, we were potentially clobbering the HardState on splits as we accidentally moved HardState synthesis upstream of Raft as well. This change moves it downstream again. Though not strictly necessary, writing lastIndex was moved as well. This is cosmetic, though it aids @irfansharif's PR cockroachdb#16809, which moves lastIndex to the Raft engine. After this PR, neither HardState nor last index keys are added to the WriteBatch, so that pre-cockroachdb#16993 `TruncateLog` is the only remaining command that does so (and it, too, won't keep doing that for long). Migration concerns: a lease holder running the new version will propose splits that don't propose the HardState to Raft. A follower running the old version will not write the HardState downstream of Raft. In combination, the HardState would never get written, and would thus be incompatible with the TruncatedState. Thus, while 1.0 might be around, we're still sending the potentially dangerous HardState. Fixes cockroachdb#16749.
Since the move to proposer-evaluated KV, we were potentially clobbering the HardState on splits as we accidentally moved HardState synthesis upstream of Raft as well. This change moves it downstream again. Though not strictly necessary, writing lastIndex was moved as well. This is cosmetic, though it aids @irfansharif's PR cockroachdb#16809, which moves lastIndex to the Raft engine. After this PR, neither HardState nor last index keys are added to the WriteBatch, so that pre-cockroachdb#16993 `TruncateLog` is the only remaining command that does so (and it, too, won't keep doing that for long). Migration concerns: a lease holder running the new version will propose splits that don't propose the HardState to Raft. A follower running the old version will not write the HardState downstream of Raft. In combination, the HardState would never get written, and would thus be incompatible with the TruncatedState. Thus, while 1.0 might be around, we're still sending the potentially dangerous HardState. Fixes cockroachdb#16749.
Since the move to proposer-evaluated KV, we were potentially clobbering the HardState on splits as we accidentally moved HardState synthesis upstream of Raft as well. This change moves it downstream again. Though not strictly necessary, writing lastIndex was moved as well. This is cosmetic, though it aids @irfansharif's PR cockroachdb#16809, which moves lastIndex to the Raft engine. After this PR, neither HardState nor last index keys are added to the WriteBatch, so that pre-cockroachdb#16993 `TruncateLog` is the only remaining command that does so (and it, too, won't keep doing that for long). Migration concerns: a lease holder running the new version will propose splits that don't propose the HardState to Raft. A follower running the old version will not write the HardState downstream of Raft. In combination, the HardState would never get written, and would thus be incompatible with the TruncatedState. Thus, while 1.0 might be around, we're still sending the potentially dangerous HardState. Fixes cockroachdb#16749.
Raft log truncations currently perform two steps (there may be others, but for the sake of this discussion, let's consider only these two): 1. above raft, they compute the stats of all raft log entries up to the truncation entry. 2. beneath raft, they use ClearIterRange to clear all raft log entries up to the truncation entry. In both steps, operations are performed on all entries up to the truncation entry, and in both steps these operations start from entry 0. A comment added in cockroachdb#16993 gives some idea as to why: > // We start at index zero because it's always possible that a previous > // truncation did not clean up entries made obsolete by the previous > // truncation. My current understanding is that this case where a Raft log has been truncated but its entries not cleaned up is only possible if a node crashes between `applyRaftCommand` and `handleEvalResultRaftMuLocked`. This brings up the question: why don't we truncate raft entries downstream of raft in `applyRaftCommand`? That way, the entries could be deleted atomically with the update to the `RaftTruncatedStateKey` and we wouldn't have to worry about them ever diverging or Raft entries being leaked. That seems like a trivial change, and if that was the case, would the approach here be safe? I don't see a reason why not. For motivation on why we should explore this, I've found that when running `sysbench oltp_insert` on a fresh cluster without pre-splits to measure single range write through, raft log truncation accounts for about 20% of CPU utilization. If we switch the ClearIterRange to a ClearRange downstream of raft, we improve throughput by 13% and reduce the amount of CPU that raft log truncation uses to about 5%. It's obvious why this speeds up the actual truncation itself downstream of raft. The reason why it speeds up the stats computation is less clear, but it may be allowing a RocksDB iterator to more easily skip over the deleted entry keys. If we make the change proposed here, we improve throughput by 28% and reduce the amount of CPU that raft log truncation uses to a negligible amount (< 1%, hard to tell exactly). The reason this speeds both the truncation and the stats computation is because it avoids iterating over RocksDB tombstones for all Raft entries that have ever existed on the range. The throughput improvements are of course exaggerated because we are isolating the throughput of a single range, but they're significant enough to warrant exploration about whether we can make this approach work. Finally, the outsized impact of this small change naturally justifies further exploration. If we could make the change here safe (i.e. if we could depend on replica.FirstIndex() to always be a lower bound on raft log entry keys), could we make similar changes elsewhere? Are there other places where we iterate over an entire raft log keyspace and inadvertently run into all of the deletion tombstones when we could simply skip to the `replica.FirstIndex()`? At a minimum, I believe that `clearRangeData` fits this description, so there may be room to speed up snapshots and replica GC. Release note (performance improvement): Reduce the cost of Raft log truncations and increase single-range throughput.
Raft log truncations currently perform two steps (there may be others, but for the sake of this discussion, let's consider only these two): 1. above raft, they compute the stats of all raft log entries up to the truncation entry. 2. beneath raft, they use ClearIterRange to clear all raft log entries up to the truncation entry. In both steps, operations are performed on all entries up to the truncation entry, and in both steps these operations start from entry 0. A comment added in cockroachdb#16993 gives some idea as to why: > // We start at index zero because it's always possible that a previous > // truncation did not clean up entries made obsolete by the previous > // truncation. My current understanding is that this case where a Raft log has been truncated but its entries not cleaned up is only possible if a node crashes between `applyRaftCommand` and `handleEvalResultRaftMuLocked`. This brings up the question: why don't we truncate raft entries downstream of raft in `applyRaftCommand`? That way, the entries could be deleted atomically with the update to the `RaftTruncatedStateKey` and we wouldn't have to worry about them ever diverging or Raft entries being leaked. That seems like a trivial change, and if that was the case, would the approach here be safe? I don't see a reason why not. For motivation on why we should explore this, I've found that when running `sysbench oltp_insert` on a fresh cluster without pre-splits to measure single range write through, raft log truncation accounts for about 20% of CPU utilization. If we switch the ClearIterRange to a ClearRange downstream of raft, we improve throughput by 13% and reduce the amount of CPU that raft log truncation uses to about 5%. It's obvious why this speeds up the actual truncation itself downstream of raft. The reason why it speeds up the stats computation is less clear, but it may be allowing a RocksDB iterator to more easily skip over the deleted entry keys. If we make the change proposed here, we improve throughput by 28% and reduce the amount of CPU that raft log truncation uses to a negligible amount (< 1%, hard to tell exactly). The reason this speeds both the truncation and the stats computation is because it avoids iterating over RocksDB tombstones for all Raft entries that have ever existed on the range. The throughput improvements are of course exaggerated because we are isolating the throughput of a single range, but they're significant enough to warrant exploration about whether we can make this approach work. Finally, the outsized impact of this small change naturally justifies further exploration. If we could make the change here safe (i.e. if we could depend on replica.FirstIndex() to always be a lower bound on raft log entry keys), could we make similar changes elsewhere? Are there other places where we iterate over an entire raft log keyspace and inadvertently run into all of the deletion tombstones when we could simply skip to the `replica.FirstIndex()`? At a minimum, I believe that `clearRangeData` fits this description, so there may be room to speed up snapshots and replica GC. Release note (performance improvement): Reduce the cost of Raft log truncations and increase single-range throughput.
28126: storage: truncate log only between first index and truncate index r=nvanbenschoten a=nvanbenschoten ### Question Raft log truncations currently perform two steps (there may be others, but for the sake of this discussion, let's consider only these two): 1. above raft, they compute the stats of all raft log entries up to the truncation entry. 2. beneath raft, they use ClearIterRange to clear all raft log entries up to the truncation entry. In both steps, operations are performed on all entries up to the truncation entry, and in both steps these operations start from entry 0. A comment added in #16993 gives some idea as to why: > // We start at index zero because it's always possible that a previous > // truncation did not clean up entries made obsolete by the previous > // truncation. My current understanding is that this case where a Raft log has been truncated but its entries not cleaned up is only possible if a node crashes between `applyRaftCommand` and `handleEvalResultRaftMuLocked`. This brings up the question: why don't we truncate raft entries downstream of raft in `applyRaftCommand`? That way, the entries could be deleted atomically with the update to the `RaftTruncatedStateKey` and we wouldn't have to worry about them ever diverging or Raft entries being leaked. That seems like a trivial change, and if that was the case, would the approach here be safe? I don't see a reason why not. ### Motivation For motivation on why we should explore this, I've found that when running `sysbench oltp_insert` on a fresh cluster without pre-splits to measure single range write through, raft log truncation accounts for about **20%** of CPU utilization. <img width="1272" alt="truncate" src="https://user-images.githubusercontent.com/5438456/43502846-bb7a98d2-952a-11e8-9ba0-0b886d3e3ad9.png"> If we switch the ClearIterRange to a ClearRange downstream of raft, we improve throughput by **13%** and reduce the amount of CPU that raft log truncation uses to about **5%**. It's obvious why this speeds up the actual truncation itself downstream of raft. The reason why it speeds up the stats computation is less clear, but it may be allowing a RocksDB iterator to more easily skip over the deleted entry keys. If we make the change proposed here, we improve throughput by **28%** and reduce the amount of CPU that raft log truncation uses to a negligible amount (**< 1%**, hard to tell exactly). The reason this speeds both the truncation and the stats computation is because it avoids iterating over RocksDB tombstones for all Raft entries that have ever existed on the range. The throughput improvements are of course exaggerated because we are isolating the throughput of a single range, but they're significant enough to warrant exploration about whether we can make this approach work. ### Extension Finally, the outsized impact of this small change naturally justifies further exploration. If we could make the change here safe (i.e. if we could depend on replica.FirstIndex() to always be a lower bound on raft log entry keys), could we make similar changes elsewhere? Are there other places where we iterate over an entire raft log keyspace and inadvertently run into all of the deletion tombstones when we could simply skip to the `replica.FirstIndex()`? At a minimum, I believe that `clearRangeData` fits this description, so there may be room to speed up snapshots and replica GC. cc. @tschottdorf @petermattis @benesch Co-authored-by: Nathan VanBenschoten <[email protected]>
Please discuss - this looks like an easy win. I just whipped this up on a whim,
so I hope I'm not missing something obvious here.
--
Sending log truncations through Raft is inefficient: the Raft log is not itself
part of the replicated state. Instead, we only replicate the TruncatedState and,
as a side effect, ClearRange() the affected key range.
This is an individual performance optimization whose impact we should measure;
anecdotally it always looked like we were doing a lot of work for truncations
during a write-heavy workload; this should alleviate this somewhat).
It also removes one migration concern for #16809, see
#16809 (comment).