-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: batch command application and coalesce applied state per batch #38568
storage: batch command application and coalesce applied state per batch #38568
Conversation
a57a49c
to
e32ad08
Compare
e32ad08
to
006f72e
Compare
In the initial version of this revision I had been accidentally stepping the raft machine for each entry (instead of just conf changes). This was remedied early on but I hadn't re-run the benchmarks. They have now been updated to show a solid 48% throughput win on KV0/seq=true. I'm sure there's more commentary to add to this PR and I'm sure I've made some things stale but on the whole it's a pretty exciting win. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool!
Before this change each entry would update this applied state and each of those writes will end up
in the WAL, mem-table, and L0 just the be compacted away in L1
My understanding was that they would be collapsed when compacting from the mem-table to L0, not from L0 to L1. Am I confused?
Reviewed 4 of 7 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)
pkg/storage/application_state_buf.go, line 21 at r1 (raw file):
// applicationStateBufNode. // TODO(ajwerner): justify this number. const applicationStateBufSize = 8
s/applicationStateBufSize/applicationStateBufChunkSize/
pkg/storage/replica_application.go, line 37 at r1 (raw file):
) // applicationState stores the state required to apply an entry to a
s/an entry/a single entry/
pkg/storage/replica_application.go, line 39 at r1 (raw file):
// applicationState stores the state required to apply an entry to a // replica. The state is accumulated in stages which occur in // Replica.handleCommittedEntriesRaftMuLocked. From a high level, a command is
Enumerating the steps here would be helpful. Or maybe add a bit of pseudo code like we have in #37426.
pkg/storage/replica_application.go, line 46 at r1 (raw file):
// Then the batch is committed, the side-effects are applied and the local // result is processed. type applicationState struct {
nit: something about the name applicationState
seems particularly overloaded to me in the world of computer science. I know you're using "application" to mean Raft entry application, but it still jumps out at me after reviewing this entire PR. It also doesn't do a good job conveying that this is per-entry state and nothing more. Did you consider any other names? Maybe:
entryApplication
entryApplicationState
entAppliedState
entAppState
pkg/storage/replica_application.go, line 50 at r1 (raw file):
// proposal's context if this entry was proposed locally. ctx context.Context // e is th Entry being applied.
the
pkg/storage/replica_application.go, line 54 at r1 (raw file):
// The below fields are decoded from e.
Remove this line so that it's clear that you're just referring to this group of four fields.
I do like that you added this comment and grouped the fields like this though. It would be helpful to a reader of this code if all of the groups of fields were commented as well. For instance, the next group would be // The below fields are populated on the proposing Replica only and come from the Replica's proposal map.
pkg/storage/replica_application.go, line 60 at r1 (raw file):
ccCtx ConfChangeContext // only populated for conf changes proposedLocally bool
Consider making this a method to reduce the footprint of the applicationStateBuf
.
pkg/storage/replica_application.go, line 67 at r1 (raw file):
leaseIndex uint64 lResult *result.LocalResult
nit: s/lResult/localResult/
pkg/storage/replica_application.go, line 81 at r1 (raw file):
// At the time of writing it is possible that the current conditions are too // strict but they are certainly sufficient. func isTrivial(usingStateAppliedKey bool, r *storagepb.ReplicatedEvalResult) bool {
nit: move the *storagepb.ReplicatedEvalResult
up because that's the more important argument.
pkg/storage/replica_application.go, line 94 at r1 (raw file):
r.State.Lease == nil && r.State.Desc == nil && (usingStateAppliedKey || !r.State.UsingAppliedStateKey)))
I would have expected this to be (!usingStateAppliedKey || r.State.UsingAppliedStateKey)
. That's the condition where we upgrade to using the applied state key, right?
Also, it would almost certainly be ok to consider all commands during the applied state key upgrade non-trivial if that simplifies the code. I think that would allow us to get rid of the usingStateAppliedKey
arg.
pkg/storage/replica_application.go, line 102 at r1 (raw file):
} // decode decodes s.e into the other fields of s.
It seems like this could be merged with initialize
to simplify the usage of applicationState
.
pkg/storage/replica_application.go, line 149 at r1 (raw file):
// commands. Committed raft commands are applied to the batch in a multi-stage // process whereby individual commands are decoded, prepared for application // relative to the current view of replicaState.
This sentence seemed to end abruptly.
pkg/storage/replica_application.go, line 154 at r1 (raw file):
stats enginepb.MVCCStats replicaState storagepb.ReplicaState updatedTruncatedState bool
Why are we allowing multiple Raft log truncations in a single batch instead of making them non-trivial commands?
pkg/storage/replica_application.go, line 182 at r1 (raw file):
applying it its batch
pkg/storage/replica_application.go, line 185 at r1 (raw file):
// This function is called after a batch has been written to the storage engine. // For trivial commands this function should result is a zero value rResult. func clearTrivialReplicatedEvalResultFields(
This function seems to belong right above handleComplexReplicatedEvalResult
.
pkg/storage/replica_application.go, line 190 at r1 (raw file):
// Fields for which no action is taken in this method are zeroed so that // they don't trigger an assertion at the end of the method (which checks // that all fields were handled).
It doesn't trigger an assertion anymore.
pkg/storage/replica_application.go, line 237 at r1 (raw file):
log.Infof(ctx, "flushing batch %v %v", b.replicaState, b.appStates.len) } // TODO(ajwerner): clean up this sanity check
Reminder to address this TODO.
pkg/storage/replica_application.go, line 242 at r1 (raw file):
} if err := b.batch.Commit(false); err != nil {
Might as well leave a note about why we don't need to sync here.
pkg/storage/replica_application.go, line 296 at r1 (raw file):
r.mu.state.RaftAppliedIndex = b.replicaState.RaftAppliedIndex r.mu.state.LeaseAppliedIndex = b.replicaState.LeaseAppliedIndex deltaStats.Subtract(*r.mu.state.Stats)
nit: you can assign *r.mu.state.Stats
to a variable and perform the subtraction outside of the lock.
pkg/storage/replica_application.go, line 356 at r1 (raw file):
// before notifying a potentially waiting client. clearTrivialReplicatedEvalResultFields(b.replicaState.UsingAppliedStateKey, rResult) if isNonTrivial && !r.handleComplexReplicatedEvalResult(s.ctx, *rResult) {
Did you consider passing rResult
as a pointer here, letting handleComplexReplicatedEvalResult
modify it, and then performing the rResult.Equal(storagepb.ReplicatedEvalResult{})
for both trivial and non-trivial commands in the same place?
pkg/storage/replica_application.go, line 395 at r1 (raw file):
} }
stray line
pkg/storage/replica_application.go, line 403 at r1 (raw file):
} // handleCommittedEntriesRaftMuLocked deals with the complexities involved in
Consider adding the term "replicated state machine" in here somewhere. This is fundamentally the entry point for which all updates to the state machine are made.
pkg/storage/replica_application.go, line 403 at r1 (raw file):
} // handleCommittedEntriesRaftMuLocked deals with the complexities involved in
nit: this function drives the rest of these functions, so I'd move it up above applyBatch
. In general, I think this file would be easiest to read if everything was in the order that it is called during entry application, mod type definitions and methods (putEntryApplicationBatch
and up), which make sense where they are.
pkg/storage/replica_application.go, line 418 at r1 (raw file):
// their ReplicatedEvalResult applied to the batch's view of the ReplicaState. // Finally the batch is written to the storage engine and its side effects on // the Replicas state are applied.
Replica's
pkg/storage/replica_application.go, line 419 at r1 (raw file):
// Finally the batch is written to the storage engine and its side effects on // the Replicas state are applied. func (r *Replica) handleCommittedEntriesRaftMuLocked(
Add to the comment about the stats
and refreshReason
arguments.
pkg/storage/replica_application.go, line 438 at r1 (raw file):
// state. b := getEntryApplicationBatch() defer putEntryApplicationBatch(b)
s/putEntryApplicationBatch/releaseEntryApplicationBatch/
pkg/storage/replica_application.go, line 441 at r1 (raw file):
// Decode commands into the batch until we find a non-trivial one. // When we find a non-trivial one, unpush it, set committed entries to be that one next.
This comment is confusing. What do you mean by "unpush" it? What does "committed entries to be that one next" mean?
pkg/storage/replica_application.go, line 451 at r1 (raw file):
// entries in the current batch then we'll unpush it, apply the batch and // then push it back. s := b.appStates.pushBack()
"pushBack" doesn't seem like the right name for this method because it's not pushing anything. Something like allocateBack
pkg/storage/replica_application.go, line 485 at r1 (raw file):
// batch. if !isTrivial(b.replicaState.UsingAppliedStateKey, &s.raftCmd.ReplicatedEvalResult) { // If there are no other entries already in the batch then we pop this
I think part of the reason why this code gets tricky is because we're defining a non-trivial command as one that both a) relies on in-memory state matching durable state and b) performs side-effects that all later commands need to observe during their application. In other words, we're defining a non-trivial command as one that needs to be applied in its own batch without any other entries before or after it.
I'm wondering if this is overly restrictive. My intuition would be that we could probably get away with having non-trivial commands be part of a batch as long as they are the last command in the batch. This construction would dramatically simplify the code structure here because it would turn the pushing and unpushing process into a simple process of: 1. batch as much as possible, 2: flush after each non-trivial command is added to the batch, 3. flush again at the end
. It would also allow us to think of each batch as having n "trivial side effects" and one "complex side effect", which is easy to understand and to structure in the code.
This would require reworking "part a" of the definition of a non-trivial command, but I suspect that it's not quite right anyway. The non-trivial commands don't require in-memory state to match durable state before they apply, just after when performing their side effects, right?
pkg/storage/replica_application.go, line 518 at r1 (raw file):
} r.mu.Unlock() b.replicaState.Stats = &b.stats
nit: This isn't performing an allocation, so there's no need to move it outside of the critical section if doing so hurts readability. I'd suggest doing something simpler like:
retreiveLocalProposals := func() {
var it applicationStateBufIterator
r.mu.Lock()
defer r.mu.Unlock()
b.replicaState = r.mu.state
b.replicaState.Stats = &b.stats
*b.replicaState.Stats = *r.mu.state.Stats
pkg/storage/replica_application.go, line 537 at r1 (raw file):
retreiveLocalProposals() b.batch = r.store.engine.NewBatch() processBatch()
processBatch
and applyBatch
sound basically the same. Consider s/processBatch/stageBatch/
.
pkg/storage/replica_application.go, line 552 at r1 (raw file):
and it is checked whether the request's key span is contained in the Replica's
This part isn't true because we pass roachpb.RSpan{}
to requestCanProceed
.
pkg/storage/replica_application.go, line 558 at r1 (raw file):
violations of the LeaseAppliedIndex (in which the caller tries again)
The caller tries again? Don't we do that below Raft with tryReproposeWithNewLeaseIndex
?
pkg/storage/replica_application.go, line 560 at r1 (raw file):
should be applied
That's done by this function. This gives the impression that it's the caller's responsibility.
pkg/storage/replica_application.go, line 944 at r1 (raw file):
return truncatedStateUpdated
stray line
pkg/storage/replica_application.go, line 1047 at r1 (raw file):
provides is performed
The wording is off here.
pkg/storage/replica_application.go, line 1173 at r1 (raw file):
} return s.raftCmd.ReplicatedEvalResult.ChangeReplicas != nil
This is a strange return value. Why doesn't the caller use the applicationState
that it already has in scope?
pkg/storage/replica_application.go, line 1181 at r1 (raw file):
// handled by the caller. // The returned ReplicatedEvalResult replaces the caller's. func (r *Replica) applyRaftCommand(
s/applyRaftCommand/applyRaftCommandToBatch/
Also, the ordering of functions/methods comment applies here as well.
pkg/storage/replica_application.go, line 1235 at r1 (raw file):
// The only remaining use of the batch is for range-local keys which we know // have not been previously written within this batch. writer := batch.Distinct()
What did we determine about this? Should we do it at all or is it a pessimization?
pkg/storage/replica_raft.go, line 683 at r1 (raw file):
if len(rd.CommittedEntries) > 0 { var expl string expl, err = r.handleCommittedEntriesRaftMuLocked(ctx, rd.CommittedEntries, &stats, &refreshReason)
Can you double check that this isn't causing stats
or refreshReason
to escape to the heap?
pkg/storage/track_raft_protos.go, line 37 at r1 (raw file):
func TrackRaftProtos() func() []reflect.Type { // Grab the name of the function that roots all raft operations. processRaftFunc := funcName((*Replica).stageRaftCommand)
s/processRaftFunc/stageRaftCommand/
006f72e
to
ddc446d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR! Sorry for the mess
My understanding was that they would be collapsed when compacting from the mem-table to L0, not from L0 to L1. Am I confused?
You are correct, I was incorrect. Wishful thinking.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
pkg/storage/application_state_buf.go, line 21 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
s/applicationStateBufSize/applicationStateBufChunkSize/
Went with entryApplicationStateBufNodeSize
pkg/storage/replica_application.go, line 37 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
s/an entry/a single entry/
Done.
pkg/storage/replica_application.go, line 39 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Enumerating the steps here would be helpful. Or maybe add a bit of pseudo code like we have in #37426.
I added the pseudocode to handleCommittedEntriesRaftMuLocked,
pkg/storage/replica_application.go, line 46 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: something about the name
applicationState
seems particularly overloaded to me in the world of computer science. I know you're using "application" to mean Raft entry application, but it still jumps out at me after reviewing this entire PR. It also doesn't do a good job conveying that this is per-entry state and nothing more. Did you consider any other names? Maybe:
entryApplication
entryApplicationState
entAppliedState
entAppState
It was all entryApplication...
and then I did a big s/entryApplication/application/g
though I clearly even typed some layer and missed (put|get)EntryApplicationBatch
. I wasn't happy with either of them but I'm happy to go back.
pkg/storage/replica_application.go, line 54 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Remove this line so that it's clear that you're just referring to this group of four fields.
I do like that you added this comment and grouped the fields like this though. It would be helpful to a reader of this code if all of the groups of fields were commented as well. For instance, the next group would be
// The below fields are populated on the proposing Replica only and come from the Replica's proposal map.
Added more structure and comments here generally.
pkg/storage/replica_application.go, line 60 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Consider making this a method to reduce the footprint of the
applicationStateBuf
.
Done.
pkg/storage/replica_application.go, line 67 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: s/lResult/localResult/
Done.
pkg/storage/replica_application.go, line 81 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: move the
*storagepb.ReplicatedEvalResult
up because that's the more important argument.
Done.
pkg/storage/replica_application.go, line 94 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I would have expected this to be
(!usingStateAppliedKey || r.State.UsingAppliedStateKey)
. That's the condition where we upgrade to using the applied state key, right?Also, it would almost certainly be ok to consider all commands during the applied state key upgrade non-trivial if that simplifies the code. I think that would allow us to get rid of the
usingStateAppliedKey
arg.
I added a comment. This logic does consider the upgrade to be non-trivial. It just considers not upgrading to be trivial.
pkg/storage/replica_application.go, line 102 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
It seems like this could be merged with
initialize
to simplify the usage ofapplicationState
.
Done.
pkg/storage/replica_application.go, line 149 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
This sentence seemed to end abruptly.
It does indeed. Fixed.
pkg/storage/replica_application.go, line 154 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Why are we allowing multiple Raft log truncations in a single batch instead of making them non-trivial commands?
Why not? Their side-effect should coalesce nicely and is handled to just take latest one. I was mostly motivated by this comment:
// Raft log truncation is too frequent to justify a replica state
// assertion.
We did discuss this offline and I'm open to making this change but will at least defer it to the next revision.
pkg/storage/replica_application.go, line 185 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
This function seems to belong right above
handleComplexReplicatedEvalResult
.
Done.
pkg/storage/replica_application.go, line 190 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
It doesn't trigger an assertion anymore.
It does trigger an assert, just not at the end of the method. Updated.
pkg/storage/replica_application.go, line 237 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Reminder to address this TODO.
Done.
pkg/storage/replica_application.go, line 242 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Might as well leave a note about why we don't need to sync here.
Done.
pkg/storage/replica_application.go, line 296 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: you can assign
*r.mu.state.Stats
to a variable and perform the subtraction outside of the lock.
Done.
pkg/storage/replica_application.go, line 356 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Did you consider passing
rResult
as a pointer here, lettinghandleComplexReplicatedEvalResult
modify it, and then performing therResult.Equal(storagepb.ReplicatedEvalResult{})
for both trivial and non-trivial commands in the same place?
I did, there is code downstream of this which assumes that rResult hasn't been modified. I could re-work it but for now this seems easier. If after the next pass we think it's worth the change I can do it. Namely in finishRaftCommand there are some assertions and then also when looking to decide whether we aborted the config change.
pkg/storage/replica_application.go, line 403 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: this function drives the rest of these functions, so I'd move it up above
applyBatch
. In general, I think this file would be easiest to read if everything was in the order that it is called during entry application, mod type definitions and methods (putEntryApplicationBatch
and up), which make sense where they are.
I've reordered the function to make this file flow better, see how you feel about it now.
pkg/storage/replica_application.go, line 403 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Consider adding the term "replicated state machine" in here somewhere. This is fundamentally the entry point for which all updates to the state machine are made.
Done.
pkg/storage/replica_application.go, line 418 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Replica's
Done.
pkg/storage/replica_application.go, line 419 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Add to the comment about the
stats
andrefreshReason
arguments.
Done.
pkg/storage/replica_application.go, line 438 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
s/putEntryApplicationBatch/releaseEntryApplicationBatch/
Done.
pkg/storage/replica_application.go, line 441 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
This comment is confusing. What do you mean by "unpush" it? What does "committed entries to be that one next" mean?
No longer applies.
pkg/storage/replica_application.go, line 451 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
"pushBack" doesn't seem like the right name for this method because it's not pushing anything. Something like
allocateBack
Done.
pkg/storage/replica_application.go, line 485 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I think part of the reason why this code gets tricky is because we're defining a non-trivial command as one that both a) relies on in-memory state matching durable state and b) performs side-effects that all later commands need to observe during their application. In other words, we're defining a non-trivial command as one that needs to be applied in its own batch without any other entries before or after it.
I'm wondering if this is overly restrictive. My intuition would be that we could probably get away with having non-trivial commands be part of a batch as long as they are the last command in the batch. This construction would dramatically simplify the code structure here because it would turn the pushing and unpushing process into a simple process of:
1. batch as much as possible, 2: flush after each non-trivial command is added to the batch, 3. flush again at the end
. It would also allow us to think of each batch as having n "trivial side effects" and one "complex side effect", which is easy to understand and to structure in the code.This would require reworking "part a" of the definition of a non-trivial command, but I suspect that it's not quite right anyway. The non-trivial commands don't require in-memory state to match durable state before they apply, just after when performing their side effects, right?
Done.
pkg/storage/replica_application.go, line 518 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: This isn't performing an allocation, so there's no need to move it outside of the critical section if doing so hurts readability. I'd suggest doing something simpler like:
retreiveLocalProposals := func() { var it applicationStateBufIterator r.mu.Lock() defer r.mu.Unlock() b.replicaState = r.mu.state b.replicaState.Stats = &b.stats *b.replicaState.Stats = *r.mu.state.Stats
Done.
pkg/storage/replica_application.go, line 537 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
processBatch
andapplyBatch
sound basically the same. Considers/processBatch/stageBatch/
.
I just inlined the closure as it wasn't providing a lot of additional clarity.
pkg/storage/replica_application.go, line 552 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
and it is checked whether the request's key span is contained in the Replica's
This part isn't true because we pass
roachpb.RSpan{}
torequestCanProceed
.
It's a stale commented but updated nevertheless
pkg/storage/replica_application.go, line 558 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
violations of the LeaseAppliedIndex (in which the caller tries again)
The caller tries again? Don't we do that below Raft with
tryReproposeWithNewLeaseIndex
?
Stale comment, updated.
pkg/storage/replica_application.go, line 560 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
should be applied
That's done by this function. This gives the impression that it's the caller's responsibility.
Updated.
pkg/storage/replica_application.go, line 944 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
stray line
Done.
pkg/storage/replica_application.go, line 1047 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
provides is performed
The wording is off here.
Done.
pkg/storage/replica_application.go, line 1173 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
This is a strange return value. Why doesn't the caller use the
applicationState
that it already has in scope?
This is old logic where this returned straight up to the loop in handleRaftReady. I've now hauled the check up.
pkg/storage/replica_application.go, line 1181 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
s/applyRaftCommand/applyRaftCommandToBatch/
Also, the ordering of functions/methods comment applies here as well.
Done.
pkg/storage/replica_application.go, line 1235 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
What did we determine about this? Should we do it at all or is it a pessimization?
Seems like it's a draw.
name old ops/s new ops/s delta
KV0-throughput 32.7k ± 3% 32.7k ± 2% ~ (p=1.000 n=4+4)
name old ms/s new ms/s delta
KV0-P50 5.95 ± 3% 5.95 ± 3% ~ (p=1.000 n=4+4)
KV0-Avg 3.77 ± 3% 3.82 ± 2% ~ (p=0.629 n=4+4)
I did try just sharing a single batch.Distinct()
after implementing the ApplyBatchRepr
method on the distinctBatch
because it seemed easier to try it than to read the code to figure out if anybody does any reads below here. It showed a win but failed a server test, I figured I could go explore that later. Note that the one with a shared Distinct()
is old in the benchmark.
name old ops/s new ops/s delta
KV0-throughput 33.5k ± 1% 32.7k ± 1% -2.41% (p=0.029 n=4+4)
name old ms/s new ms/s delta
KV0-P50 5.80 ± 0% 5.95 ± 3% ~ (p=0.143 n=4+4)
KV0-Avg 3.70 ± 0% 3.80 ± 0% +2.70% (p=0.029 n=4+4)
pkg/storage/replica_raft.go, line 683 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Can you double check that this isn't causing
stats
orrefreshReason
to escape to the heap?
./replica_raft.go:687:78: (*Replica).handleRaftReadyRaftMuLocked &stats does not escape
but weirdly enough
./replica_raft.go:687:86: &refreshReason escapes to heap
./replica_raft.go:446:2: moved to heap: refreshReason
Fixed.
pkg/storage/track_raft_protos.go, line 37 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
s/processRaftFunc/stageRaftCommand/
Done.
880122d
to
b65d72c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 7 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)
pkg/storage/entry_application_state_buf.go, line 90 at r2 (raw file):
} // entryApplicationStateRingBuf is a ring-buffer of entryApplicationState.
The ring-buffer structure here isn't necessary anymore. Should we keep it at the cost of added complexity? I think we can get rid of the entire entryApplicationStateRingBuf
type if we want.
pkg/storage/replica_application.go, line 154 at r1 (raw file):
Previously, ajwerner wrote…
Why not? Their side-effect should coalesce nicely and is handled to just take latest one. I was mostly motivated by this comment:
// Raft log truncation is too frequent to justify a replica state // assertion.
We did discuss this offline and I'm open to making this change but will at least defer it to the next revision.
We discussed this offline. The conclusion was that unlike all of the other side-effects, this one modifies the write batch directly so it doesn't fit into the trivial/non-trivial split. I wonder whether that's worth changing to simplify this code structure. If not, we should add a comment explaining why this is different than all of the other non-trivial side effects.
pkg/storage/replica_application.go, line 356 at r1 (raw file):
Previously, ajwerner wrote…
I did, there is code downstream of this which assumes that rResult hasn't been modified. I could re-work it but for now this seems easier. If after the next pass we think it's worth the change I can do it. Namely in finishRaftCommand there are some assertions and then also when looking to decide whether we aborted the config change.
Yeah I'd like to come back to this, but it can wait until everything else settles down.
pkg/storage/replica_application.go, line 485 at r1 (raw file):
Previously, ajwerner wrote…
Done.
Much better, thanks.
pkg/storage/replica_application.go, line 1047 at r1 (raw file):
Previously, ajwerner wrote…
Done.
prepareLocalResultis
missing a space
pkg/storage/replica_application.go, line 1173 at r1 (raw file):
Previously, ajwerner wrote…
This is old logic where this returned straight up to the loop in handleRaftReady. I've now hauled the check up.
Looks like we're still returning this though.
pkg/storage/replica_application.go, line 1235 at r1 (raw file):
I did try just sharing a single batch.Distinct() after implementing the ApplyBatchRepr method on the distinctBatch because it seemed easier to try it than to read the code to figure out if anybody does any reads below here. It showed a win but failed a server test, I figured I could go explore that later.
That might be because some non-trivial entries need to read from the current batch. It seems worthwhile to explore this at some point.
pkg/storage/replica_application.go, line 53 at r2 (raw file):
// The pseudocode looks like: // // while entries:
Thanks for this, it's great!
pkg/storage/replica_application.go, line 80 at r2 (raw file):
ctx context.Context, committedEntries []raftpb.Entry, stats *handleRaftReadyStats,
I just noticed that this stats object only contains information about entry application at the moment. How do you feel about lifting it into here entirely? That prevents us from adding non-application related stats, so maybe that's not ideal. If not, how do you feel about creating a handleCommittedEntriesStats
object and returning it from here?
pkg/storage/replica_application.go, line 310 at r2 (raw file):
// values) here. If the key range we are ingesting into isn't empty, // we're not using AddSSTable but a plain WriteBatch. if s.rResult().AddSSTable != nil {
Do you mind verifying that each of these pre-apply effects still work correctly now that we allow trivial entries to proceed a non-trivial entry? I think the only change we'll need to make is to adjust preDestroyRaftMuLocked
to use the current batch instead of tmpBatch
, but it would be nice to verify all of them.
pkg/storage/replica_application.go, line 718 at r2 (raw file):
// not modify rResult in order to give the TestingPostApplyFilter testing knob // an opportunity to inspect the command's ReplicatedEvalResult. func applyTrivialReplicatedEvalResult(
Should this be stageTrivialReplicatedEvalResult
because it's not modifying the real replicated state?
pkg/storage/replica_application.go, line 741 at r2 (raw file):
Upon success
or error
pkg/storage/replica_application.go, line 780 at r2 (raw file):
// Now that the batch is committed we can go about applying the side effects // of the update to the truncated state. Note that this is safe only if the // new truncated state is durably on disk (i.e.) synced.
Consider linking to the new issue you just filed.
pkg/storage/replica_application.go, line 826 at r2 (raw file):
s.rResult().RaftLogDelta = 0 } applyRaftLogDelta(-1 * truncationDelta)
This is subtle. Are you favoring this over marking each entry's RaftLogDelta as 0 like we currently do? Could you add a comment explaining what's going on here?
pkg/storage/replica_application.go, line 977 at r2 (raw file):
// in-memory and on-disk ReplicaStates. If any of these actions are present, // we want to assert that these two states do not diverge. shouldAssert = !rResult.Equal(storagepb.ReplicatedEvalResult{})
Is this ever false now?
pkg/storage/replica_application.go, line 1238 at r2 (raw file):
} func (s *entryApplicationState) rResult() *storagepb.ReplicatedEvalResult {
nit: we swapped from lResult
to localResult
but then added this rResult
method. We should keep them symmetrical. lResult
and rResult
are ok, although I find that I never remember what they mean when skimming this code. I think changing this method to replicatedResult
would be the best solution.
pkg/storage/replica_application.go, line 1246 at r2 (raw file):
} // decode initialized the entryApplicationState with ctx and e and then
initializes
pkg/storage/replica_application.go, line 1250 at r2 (raw file):
func (s *entryApplicationState) decode( ctx context.Context, e *raftpb.Entry, ) (expl string, err error) {
nit: consider s/expl/errExpl/
here and below.
pkg/storage/replica_application.go, line 1342 at r2 (raw file):
ctx context.Context, toProcess []raftpb.Entry, ) (sawEmptyEntry bool, remaining []raftpb.Entry, expl string, err error) { for i := range toProcess {
nit: mutating the toProcess
slice directly would make this a little cleaner. Something like:
for len(toProcess) > 0 {
e := &toProcess[0]
toProcess = toProcess[1:]
...
if !isTrivial(...) {
break
}
}
return toProcess
pkg/storage/replica_application.go, line 1363 at r2 (raw file):
which implies that it can be applied in a batch with other trivial commands
This is a little off now. So is the "A result is fundamentally considered "trivial" if it does ..." sentence,
pkg/storage/replica_application.go, line 1392 at r2 (raw file):
b.batch = nil b.updatedTruncatedState = false // TODO(ajwerner): could clear the replicaState and stats but it's not needed
The replicaState contains a number of nested pointers, so it's probably worth clearing that out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)
pkg/storage/replica_application.go, line 537 at r1 (raw file):
Previously, ajwerner wrote…
I just inlined the closure as it wasn't providing a lot of additional clarity.
s/retreive/retrieve/
b65d72c
to
715fd6d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review. This is getting there. The biggest outstanding thing is the preDestroyRaftMuLocked
comment.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
pkg/storage/entry_application_state_buf.go, line 90 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
The ring-buffer structure here isn't necessary anymore. Should we keep it at the cost of added complexity? I think we can get rid of the entire
entryApplicationStateRingBuf
type if we want.
Done.
pkg/storage/replica_application.go, line 154 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
We discussed this offline. The conclusion was that unlike all of the other side-effects, this one modifies the write batch directly so it doesn't fit into the trivial/non-trivial split. I wonder whether that's worth changing to simplify this code structure. If not, we should add a comment explaining why this is different than all of the other non-trivial side effects.
Another point is that truncated state side-effects don't have a trigger that needs to view the replica's in-memory state. The fact that the command touches ReplicaState doesn't make it non-trivial per se. All commands update the MVCCStats and lease applied index.
Anyway, added more commentary and a TODO
pkg/storage/replica_application.go, line 356 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Yeah I'd like to come back to this, but it can wait until everything else settles down.
Okay, I've cleaned this up a tad more.
pkg/storage/replica_application.go, line 1047 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
prepareLocalResultis
missing a space
Done.
pkg/storage/replica_application.go, line 1173 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Looks like we're still returning this though.
Done.
pkg/storage/replica_application.go, line 1235 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I did try just sharing a single batch.Distinct() after implementing the ApplyBatchRepr method on the distinctBatch because it seemed easier to try it than to read the code to figure out if anybody does any reads below here. It showed a win but failed a server test, I figured I could go explore that later.
That might be because some non-trivial entries need to read from the current batch. It seems worthwhile to explore this at some point.
Ack.
pkg/storage/replica_application.go, line 80 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I just noticed that this stats object only contains information about entry application at the moment. How do you feel about lifting it into here entirely? That prevents us from adding non-application related stats, so maybe that's not ideal. If not, how do you feel about creating a
handleCommittedEntriesStats
object and returning it from here?
Done.
pkg/storage/replica_application.go, line 288 at r2 (raw file):
} // Update the node clock with the serviced request. This maintains
This can get coalesced per batch right? I've added a TODO
pkg/storage/replica_application.go, line 310 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Do you mind verifying that each of these pre-apply effects still work correctly now that we allow trivial entries to proceed a non-trivial entry? I think the only change we'll need to make is to adjust
preDestroyRaftMuLocked
to use the current batch instead oftmpBatch
, but it would be nice to verify all of them.
This is actually tricky for the preDestroyRaftMuLocked case. The old logic applies the current entry's writeBatch to the tmpBatch. When this is called we haven't yet written the current entry's writeBatch to the batch. In order to give an equivalent view we'd need to create another tmpBatch and write the current batch's repr into it and then write this writeBatch's repr into it. That sounds rather expensive. We may want to rework this entirely.
As for split's use of tmpBatch, I think it's safe and added a comment.
The other checks I believe are safe because they only access fields on the replica updated by non-trivial commands.
pkg/storage/replica_application.go, line 718 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Should this be
stageTrivialReplicatedEvalResult
because it's not modifying the real replicated state?
Done. Also fixed some grammar.
pkg/storage/replica_application.go, line 741 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Upon success
or error
Done.
pkg/storage/replica_application.go, line 780 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Consider linking to the new issue you just filed.
Done.
pkg/storage/replica_application.go, line 826 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
This is subtle. Are you favoring this over marking each entry's RaftLogDelta as 0 like we currently do? Could you add a comment explaining what's going on here?
I'm not sure I totally follow what you mean w/ regards to marking each entry's RaftLogDelta as 0. I believe that logic remains the same. This last call here is about making sure that the change in the raft log size due to sideloaded commands is properly reflected. One source of confusion was the name truncationDelta
which is actually the bytes removed by truncating the sideloaded entries. I've renamed the variable for clarity.
Do you have more concerns?
pkg/storage/replica_application.go, line 977 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Is this ever false now?
No, I've made this an assertion.
pkg/storage/replica_application.go, line 1238 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: we swapped from
lResult
tolocalResult
but then added thisrResult
method. We should keep them symmetrical.lResult
andrResult
are ok, although I find that I never remember what they mean when skimming this code. I think changing this method toreplicatedResult
would be the best solution.
Done.
pkg/storage/replica_application.go, line 1246 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
initializes
Done.
pkg/storage/replica_application.go, line 1250 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: consider
s/expl/errExpl/
here and below.
Done.
pkg/storage/replica_application.go, line 1342 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: mutating the
toProcess
slice directly would make this a little cleaner. Something like:for len(toProcess) > 0 { e := &toProcess[0] toProcess = toProcess[1:] ... if !isTrivial(...) { break } } return toProcess
Done.
pkg/storage/replica_application.go, line 1363 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
which implies that it can be applied in a batch with other trivial commands
This is a little off now. So is the "A result is fundamentally considered "trivial" if it does ..." sentence,
Updated somewhat though I'm not sure I see how theA result is fundamentally considered "trivial" if it does ...
sentence is off?
pkg/storage/replica_application.go, line 1392 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
The replicaState contains a number of nested pointers, so it's probably worth clearing that out.
Done.
715fd6d
to
27cf2b4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
pkg/storage/replica_application.go, line 537 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
s/retreive/retrieve/
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 6 of 6 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)
pkg/storage/entry_application_state_buf.go, line 34 at r3 (raw file):
} // entryApplicationStateRingBuf is a ring-buffer of entryApplicationState.
This comment is wrong.
pkg/storage/replica_application.go, line 288 at r2 (raw file):
Previously, ajwerner wrote…
This can get coalesced per batch right? I've added a TODO
Yes, although I think it needs to happen before we ack requests, so the early acknowledgment change will need to be careful about this.
pkg/storage/replica_application.go, line 310 at r2 (raw file):
We may want to rework this entirely.
What did you have in mind? I wonder if we could clear the batch completely.
pkg/storage/replica_application.go, line 826 at r2 (raw file):
Previously, ajwerner wrote…
I'm not sure I totally follow what you mean w/ regards to marking each entry's RaftLogDelta as 0. I believe that logic remains the same. This last call here is about making sure that the change in the raft log size due to sideloaded commands is properly reflected. One source of confusion was the name
truncationDelta
which is actually the bytes removed by truncating the sideloaded entries. I've renamed the variable for clarity.Do you have more concerns?
Oh, I see. Ok that makes sense.
pkg/storage/replica_application.go, line 1250 at r2 (raw file):
Previously, ajwerner wrote…
Done.
Not for this change, but we could clean some of this (errExp, expl)
stuff up with an error wrapper type and the new errors.As
style.
https://github.com/cockroachdb/errors#providing-pii-free-details discusses a errors.WithSafeDetails
function, which might be exactly what we want.
I'd leave a TODO.
pkg/storage/replica_application.go, line 1363 at r2 (raw file):
Previously, ajwerner wrote…
Updated somewhat though I'm not sure I see how the
A result is fundamentally considered "trivial" if it does ...
sentence is off?
It looks good now.
27cf2b4
to
3b6194e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
pkg/storage/entry_application_state_buf.go, line 34 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
This comment is wrong.
Fixed. Also just inlined all the Node methods into the buf itself. I don't think they added much now that the data structure is so much simpler.
pkg/storage/replica_application.go, line 288 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Yes, although I think it needs to happen before we ack requests, so the early acknowledgment change will need to be careful about this.
Sure, it can be done immediately after decoding a batch.
pkg/storage/replica_application.go, line 310 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
We may want to rework this entirely.
What did you have in mind? I wonder if we could clear the batch completely.
This code seems to only read from the batch to determine if it's going to delete enough keys to warrant a range deletion tombstone (https://sourcegraph.com/github.com/cockroachdb/cockroach/-/blob/pkg/storage/replica_raftstorage.go#L665). I was going to suggest we make this heuristic more explicit at this level and see if we can find some other indication, perhaps combining the existing heuristic with an MVCCStats delta.
On second thought it seems like it's probably okay to base the heuristic on the state of the range prior to this write batch (your initial suggestion) and add more comments. It's not clear to me how important it is to also observe the writes of this batch to the range. Thoughts?
pkg/storage/replica_application.go, line 1250 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Not for this change, but we could clean some of this
(errExp, expl)
stuff up with an error wrapper type and the newerrors.As
style.https://github.com/cockroachdb/errors#providing-pii-free-details discusses a
errors.WithSafeDetails
function, which might be exactly what we want.I'd leave a TODO.
Done.
3b6194e
to
3db9579
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
pkg/storage/replica_application.go, line 310 at r2 (raw file):
Previously, ajwerner wrote…
This code seems to only read from the batch to determine if it's going to delete enough keys to warrant a range deletion tombstone (https://sourcegraph.com/github.com/cockroachdb/cockroach/-/blob/pkg/storage/replica_raftstorage.go#L665). I was going to suggest we make this heuristic more explicit at this level and see if we can find some other indication, perhaps combining the existing heuristic with an MVCCStats delta.
On second thought it seems like it's probably okay to base the heuristic on the state of the range prior to this write batch (your initial suggestion) and add more comments. It's not clear to me how important it is to also observe the writes of this batch to the range. Thoughts?
Discussed offline. These state updates need to be atomic with the batch but they don't need to necessarily be part of the writeBatch that comes with the command. I moved the application of the entry's writeBatch up to precede these preapply calls. It all seems to work and avoids the inefficiency of the tmpBatch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 2 files at r4.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)
Just chatted with @nvanbenschoten and I'll also give this a full review. Won't get to it until tomorrow though. |
No rush and thanks! |
3db9579
to
d4c6d91
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Phew, that was a chunk to get through but here I am.
The way you're structuring the batching is very well thought out and there's lots of good commentary that I'm sure wasn't there before. That is to say, even if this hadn't moved the needle on perf at all, it would have been worth it.
There's some code movement in this PR and perhaps reviewing would've been easier had you factored that out into a separate commit, but I know that this sort of thing constitutes a herculean feat in itself after a certain point, and not having the movement split out forced me to review more of the existing code in detail (again), which is good.
Still, obviously the change is quite complex and it's hard for me to come away from the PR feeling comfortable that everything is chipper. My most salient suggestion among all of the comments is that of being more conservative w.r.t what to batch: right now it's "stop batch when you see something nontrivial", I'd prefer moving to a "nontrivial gets their own batch" model. I think I also found a place (AddSSTable) in which the former is incorrect, but even if that were the case, we just shouldn't be batching anything that doesn't move the perf needle so that a) we get easier semantics and b) the mistakes we will certainly make in the future will be less likely to actually manifest in incorrect behavior, though I'd also like to see some testing for b) that strikes when we add side effects without updating the black/whitelists/the test.
A fair share of the comments I left in this PR don't need to be addressed in the PR itself, so whatever you feel like leaving out or punting (including to someone else), that's fine by me.
Reviewed 1 of 7 files at r2, 1 of 6 files at r3, 1 of 2 files at r4, 4 of 4 files at r5.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner)
pkg/storage/entry_application_state_buf.go, line 43 at r5 (raw file):
first
pkg/storage/entry_application_state_buf.go, line 71 at r5 (raw file):
entryApplicationStateBufNodeSyncPool.Put(oldHead) } *buf.head = entryApplicationStateBufNode{}
I'm surprised that you keep one node around and manually remove it in destroy()
, that feels icky. If there's a reason add a comment, otherwise simplify.
pkg/storage/entry_application_state_buf.go, line 85 at r5 (raw file):
*buf = entryApplicationStateBuf{} }
Add brief comments on all the things from here on down, and in particular on the return value of init
.
pkg/storage/entry_application_state_buf.go, line 91 at r5 (raw file):
type entryApplicationStateBufIterator struct { idx int32
Comment on idx and offset. Isn't offset
just idx % entryApplicationStateBufNodeSize
? If so consider simplifying, it'll be easier to understand with just one counter.
pkg/storage/replica_application.go, line 38 at r5 (raw file):
type handleCommittedEntriesStats struct { processed int
Is this commands or batches? We should be able to tell from these stats whether there's a frequent command that is breaking up these batches (i.e. isn't isTrivial()
). The caller already knows how many commands there were (len(committedEntries)
so we don't necessarily need that), but the number of batches is probably interesting, along with the number of nontrivial commands (which is mostly known from the number of batches, but not completely). At a high level it's worth keeping both because we want to be able to see batches/sec as well as whether there's a significant source of "batch breaker" commands.
pkg/storage/replica_application.go, line 71 at r5 (raw file):
stages:
pkg/storage/replica_application.go, line 77 at r5 (raw file):
// reading the current replica state from underneath the Replica.mu and // determining whether any of the commands were proposed locally. Next each // command is written to the engine.Batch and have the "trivial" component of
s/have/has/
pkg/storage/replica_application.go, line 78 at r5 (raw file):
// determining whether any of the commands were proposed locally. Next each // command is written to the engine.Batch and have the "trivial" component of // their ReplicatedEvalResult applied to the batch's view of the ReplicaState.
s/their/its/
pkg/storage/replica_application.go, line 79 at r5 (raw file):
, its side effects on the Replica's state is applied, and the clients acked with the respective results.
pkg/storage/replica_application.go, line 80 at r5 (raw file):
// their ReplicatedEvalResult applied to the batch's view of the ReplicaState. // Finally the batch is written to the storage engine and its side effects on // the Replica's state are applied.
Good job on the comment!
pkg/storage/replica_application.go, line 80 at r5 (raw file):
// their ReplicatedEvalResult applied to the batch's view of the ReplicaState. // Finally the batch is written to the storage engine and its side effects on // the Replica's state are applied.
Mention that errors from this method are fatal. (They still are, right)
pkg/storage/replica_application.go, line 85 at r5 (raw file):
) ( stats handleCommittedEntriesStats, refreshReason refreshRaftReason,
At this point it's neither clear to me why this receives a refreshReason but even less why it returns a new one. Can you say something about this in a comment? There's probably a canonical situation in which this is necessary and it'll be helpful to prime readers for it.
I'm seeing below that this is only relevant when we see an empty entry, so how do you feel about just returning a bool
or even just adding a numEmptyEntries
field in the stats
, so that the refresh reason mangling can live at the caller. I feel that this would remove an unnecessary distraction from this method.
pkg/storage/replica_application.go, line 90 at r5 (raw file):
) { // TODO(ajwerner): replace errExpl with proper error composition using the new // errors library.
😍
pkg/storage/replica_application.go, line 125 at r5 (raw file):
} } if err != nil {
Handle this right after err
is populated, pretty sure that's only unintentionally not the case right now.
pkg/storage/replica_application.go, line 133 at r5 (raw file):
// engine.Batch and update b's view of the replicaState. var it entryApplicationStateBufIterator for ok := it.init(&b.appStates); ok; ok = it.next() {
When is ok
initially false
?
pkg/storage/replica_application.go, line 134 at r5 (raw file):
var it entryApplicationStateBufIterator for ok := it.init(&b.appStates); ok; ok = it.next() { s := it.state()
state
is pretty opaque, do you think we can come up with something better? This is basically a "context for applying an entry. Maybe
entryApplicationStatecould become
applicationContext(or even just
application) and the local variable here could be
app. I don't know how I feel about the type renames I'm suggesting, maybe renaming to
sto
appis enough here. Or even just
cmd, since you're passing it to
stageRaftCommand. The reason not to go with that would be if
cmdis too ambiguous, but I quite like it. Or
appCmd` to make it clear that we're applying?
pkg/storage/replica_application.go, line 135 at r5 (raw file):
for ok := it.init(&b.appStates); ok; ok = it.next() { s := it.state() r.stageRaftCommand(s.ctx, s, b.batch, &b.replicaState, it.isLast())
it.isLast() /* writeAppliedState */
pkg/storage/replica_application.go, line 136 at r5 (raw file):
s := it.state() r.stageRaftCommand(s.ctx, s, b.batch, &b.replicaState, it.isLast()) updatedTruncatedState := stageTrivialReplicatedEvalResult(s.ctx,
Put some words about why this is handled here, looks sort of one-offy and there's a good reason I'm sure.
pkg/storage/replica_application.go, line 155 at r5 (raw file):
*b.replicaState.Stats = *r.mu.state.Stats var it entryApplicationStateBufIterator for ok := it.init(&b.appStates); ok; ok = it.next() {
I was initially wondering why you did this iteration twice, but you're doing it because this is under the replica mutex.
pkg/storage/replica_application.go, line 165 at r5 (raw file):
// the same is true for quotaReleaseQueues. Only queue the proposal's // quota for release if the proposalQuota is initialized. if r.mu.proposalQuota != nil {
Probably not worth it, but you can hoist this nil check out of the loop.
pkg/storage/replica_application.go, line 181 at r5 (raw file):
// state machine, and lastly the GCThreshold is validated. If any of the checks // fail, the proposal's content is wiped and we apply an empty log entry // instead, returning an error to the caller to handle. The two typical cases
No error is returned to the caller, it will go to whoever is waiting on the proposal if proposed locally, and discarded otherwise.
pkg/storage/replica_application.go, line 190 at r5 (raw file):
The standard way in which this is triggered
This doesn't add much, consider something like
in this method via a side effect (in the proposal's ReplicatedEvalResult or, for local proposals, LocalEvalResult).
pkg/storage/replica_application.go, line 268 at r5 (raw file):
s.forcedErr = roachpb.NewError(err) } else if splitMergeUnlock != nil { // Close over raftCmd to capture its value at execution time; we clear
This comment needs a touch up.
pkg/storage/replica_application.go, line 300 at r5 (raw file):
// a timestamp specified are guaranteed one higher than any op already // executed for overlapping keys. // TODO(ajwerner): coalesce the clock update per batch.
👍
pkg/storage/replica_application.go, line 337 at r5 (raw file):
// the SSTable. Not doing so could result in order reversal (and missing // values) here. If the key range we are ingesting into isn't empty, // we're not using AddSSTable but a plain WriteBatch.
This isn't true any more, see
cockroach/pkg/storage/batcheval/cmd_add_sstable.go
Lines 56 to 62 in c097a16
// *If* the keys in the SST do not conflict with keys currently in this range, | |
// then adding the stats for this SST to the range stats should yield the | |
// correct overall stats. | |
// | |
// *However*, if the keys in this range *do* overlap with keys already in this | |
// range, then adding the SST semantically *replaces*, rather than adds, those | |
// keys, and the net effect on the stats is not so simple. |
pkg/storage/replica_application.go, line 382 at r5 (raw file):
} func checkForcedErr(
Am I reviewing this or is this an exact move with adjustments only?
pkg/storage/replica_application.go, line 529 at r5 (raw file):
(which must be considered fatal)
pkg/storage/replica_application.go, line 537 at r5 (raw file):
replicatedResult storagepb.ReplicatedEvalResult, raftAppliedIndex, leaseAppliedIndex uint64, writeBatch *storagepb.WriteBatch,
Maybe not for this PR, but passing a pointer here doesn't make much sense. WriteBatch
contains only a []byte
, so we can just pass it by value and then these writeBatch != nil
checks just turn into len(writeBatch.Data) > 0
. Or just pass a []byte
in, that makes even more sense probably.
pkg/storage/replica_application.go, line 549 at r5 (raw file):
log.Errorf(ctx, "unable to read header of committed WriteBatch: %+v", err) } else { r.writeStats.recordCount(float64(mutationCount), 0 /* nodeID */)
All of these contend on a mutex (only in the write path), so it might be worth hoisting this out too.
pkg/storage/replica_application.go, line 578 at r5 (raw file):
haveTruncatedState := replicatedResult.State != nil && replicatedResult.State.TruncatedState != nil if writeBatch != nil {
In particular, if we ever got a WriteBatch{}
here, we'd fail because (I think) ApplyBatchRepr
expects at least the magic header in the WriteBatch, so this check is better written as len(writeBatch.Data) > 0
once writeBatch
itself is a value.
pkg/storage/replica_application.go, line 665 at r5 (raw file):
} // TODO(peter): We did not close the writer in an earlier version of
This TODO is firmly in the "never gonna happen" department. Remove it.
pkg/storage/replica_application.go, line 698 at r5 (raw file):
elapsed := timeutil.Since(start) r.store.metrics.RaftCommandCommitLatency.RecordValue(elapsed.Nanoseconds())
Rename to RaftCommandApplyLatency
. I've been wondering what this number is more than once. (Hmm, this will rename the time series, but honestly it's still a good idea to do the rename)
pkg/storage/replica_application.go, line 708 at r5 (raw file):
// not modify replicatedResult in order to give the TestingPostApplyFilter testing knob // an opportunity to inspect the command's ReplicatedEvalResult. func stageTrivialReplicatedEvalResult(
To me this method gives more ammunition to the argument that isTrivial
should be a whitelist. Let's put this and isTrivial
next to each other so that they're more obviously in sync, too.
Between those there's enough to pull out yet another file, I think. replica_application_result.go
? This file has ~1.4k LOC and the side effects pull out sort of cleanly.
pkg/storage/replica_application.go, line 722 at r5 (raw file):
replicaState.LeaseAppliedIndex = leaseAppliedIndex } haveState := replicatedResult.State != nil
Is there any real benefit to including this in the "trivial" results? What's in here that happens frequently enough to matter? (I generally think we ought to batch only the stuff that makes us fast, not any bit more than that).
pkg/storage/replica_application.go, line 775 at r5 (raw file):
truncState := b.replicaState.TruncatedState r.store.raftEntryCache.Clear(r.RangeID, truncState.Index+1) // Truncate the sideloaded storage. This is true at the time of writing but unfortunately
That second sentence is ... unexpected. Also, doesn't seem like the extra level of indentation below is needed.
pkg/storage/replica_application.go, line 810 at r5 (raw file):
// Finally apply the truncation delta. // Store the queuing conditions var it entryApplicationStateBufIterator
I'd rewrite this section as
combinedDelta := -sideloadTruncationDelta
for /* ... */ {
combinedDelta += s.replicatedResult().RaftLogDelta
s.replicatedResult().RaftLogDelta = 0
}
if combinedDelta != 0 {
/* inline the closure */
}
You were careful here because you didn't want mixing between the two contributions? That should be OK. If we trust our log size (and only then is it used for truncation decisions) the clamping oughtn't change the outcome.
pkg/storage/replica_application.go, line 830 at r5 (raw file):
r.store.metrics.addMVCCStats(deltaStats) // NB: the bootstrap store has a nil split queue. // TODO(tbg): the above is probably a lie now.
I really should've just removed this and not made a lame TODO out in my name here and below. Alas.
pkg/storage/replica_application.go, line 850 at r5 (raw file):
s := it.state() replicatedResult := s.replicatedResult() for _, sc := range replicatedResult.SuggestedCompactions {
You could collect the suggestions in the RaftLogDelta iteration below. Not sure that's any better. It definitely isn't any clearer.
pkg/storage/replica_application.go, line 866 at r5 (raw file):
// Note that this must happen after committing (the engine.Batch), but // before notifying a potentially waiting client. clearTrivialReplicatedEvalResultFields(s.replicatedResult(), b.replicaState.UsingAppliedStateKey)
This method should also sit next to isTrivial
and the other one I commented about (stageTrivialReplicatedEvalResult
)?
pkg/storage/replica_application.go, line 877 at r5 (raw file):
// Some tests (TestRangeStatsInit) assumes that once the store has started // and the first range has a lease that there will not be a later hard-state. if isNonTrivialEntry {
If we move to a more conservative batching model, we may want to have another whitelist on what to assert. Though maybe this is OK. We should keep track on how many assertStateLocked calls we have (on stats
).
pkg/storage/replica_application.go, line 916 at r5 (raw file):
result in
pkg/storage/replica_application.go, line 917 at r5 (raw file):
// ReplicaState. This function is called after a batch has been written to the // storage engine. For trivial commands this function should result is a zero // value replicatedResult.
nit: *replicaResult
.
Check this in race builds at the caller somewhere.
pkg/storage/replica_application.go, line 1045 at r5 (raw file):
// processing the removal from the queue looks up the Range at the // lease holder, being too early here turns this into a no-op). // Lock ordering dictates that we don't hold any mutexes when adding,
Comment is stale, the task is now built into the queues.
pkg/storage/replica_application.go, line 1187 at r5 (raw file):
} // entryApplicationState stores the state required to apply a single raft
Move this and its receivers to replica_application_state.go
?
pkg/storage/replica_application.go, line 1288 at r5 (raw file):
} // entryApplicationBatch accumulates state due to the application of raft
Ditto replica_application_batch.go
pkg/storage/replica_application.go, line 1370 at r5 (raw file):
// side effects which rely on the written state of the replica exactly matching // the in-memory state of the replica at the corresponding log position. // Non-trivial commands must be the last entry in a batch so that after
From what I'm seeing below steady state traffic is always trivial, and I assume that'll stay this way no matter what we add, so it won't matter for performance and we should lower the cognitive burden as much as possible. Can we change it so that !isTrivial()
implies "alone in batch", not just "last in batch"? It'll be nice to have the guarantee that no mixing whatsover will occur.
For example, look at AddSSTable
. We add the SST to the LSM before applying the command (we have to, the other way around doesn't work at all). But it means that the previous cmds in the batch shouldn't touch anything that the SST also touches, or the SST will lose precedence: the SST will be shadowed by writes in earlier commands. That's incorrect. On the other hand, we know inside of the AddSSTable command itself that we won't overlap with the WriteBatch of the command itself, because we know to not throw random stuff into the WriteBatch
in cmd_add_sstable.go
.
Also, what's the reason for using a blacklist? I'd much rather deal with commands accidentally not getting batched as opposed to adding something new, forgetting to add it here, and batching it in with unknown results. But either way, there needs to be a test (maybe you wrote it already) that fails whenever a new field is added, and forces an update to the test as a sign-off on awareness of isTrivial
.
pkg/storage/replica_application.go, line 1395 at r5 (raw file):
} func (b *entryApplicationBatch) reset() {
Can you write this as
appStates := b.appStates
appStates.clear()
*b = entryApplicationBatch{appStates: appStates}
so I don't have to be nervous that a field isn't cleared now or after any future change?
pkg/storage/entry_application_state_buf_test.go, line 20 at r5 (raw file):
) // TestApplicationStateBuf is an overly simplistic test of the
👀 does that imply that you're replacing this with a less overly simplistic one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, and I meant to say: great work! Delighted with how this is all shaping up.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner)
:) should have stopped and let you have a say before I went on polishing this. The initial version of this PR did exactly this, the buffer had a mechanism to store an extra entry as a sort of "lookahead". From Nathan:
I'm happy to go back to the previous approach but let's reach consensus before I make the change. |
Heh, oops, sorry about that. @nvanbenschoten what do you think? I never saw the version of the code I suggested, you have, so see what you think about the safety argument. The |
The I think I'm ok with moving back to the "non-trivial entries must live alone" style, but I hope we don't regress to the lookahead mechanism we had before, along with the highly-specialized I think we can do better by turning decoding into a two-stage process. We could first decode into a temporary Getting the current style to work also allowed us to get rid of the |
I think that this change is orthogonal. It raises a question about why the tmpBatch is there in the first place if we could avoid it by just applying the write batch first.
Reasonable, let me see what I can do. |
d4c6d91
to
7716922
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review! I suspect there's at least one more round left to deal with my renaming choices and the triviality of log truncation. Maybe also some cleanup on decodeBuf
and decodedRaftEntry
.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)
pkg/storage/entry_application_state_buf.go, line 43 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
first
Done.
pkg/storage/entry_application_state_buf.go, line 71 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I'm surprised that you keep one node around and manually remove it in
destroy()
, that feels icky. If there's a reason add a comment, otherwise simplify.
Done. This was a vestige from when the buf had a lookahead feature.
pkg/storage/entry_application_state_buf.go, line 85 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Add brief comments on all the things from here on down, and in particular on the return value of
init
.
Done.
pkg/storage/entry_application_state_buf.go, line 91 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Comment on idx and offset. Isn't
offset
justidx % entryApplicationStateBufNodeSize
? If so consider simplifying, it'll be easier to understand with just one counter.
Done. A lot of this code got simpler over time, the nodes used to be ring buffers.
pkg/storage/replica_application.go, line 38 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Is this commands or batches? We should be able to tell from these stats whether there's a frequent command that is breaking up these batches (i.e. isn't
isTrivial()
). The caller already knows how many commands there were (len(committedEntries)
so we don't necessarily need that), but the number of batches is probably interesting, along with the number of nontrivial commands (which is mostly known from the number of batches, but not completely). At a high level it's worth keeping both because we want to be able to see batches/sec as well as whether there's a significant source of "batch breaker" commands.
Added more fields.
pkg/storage/replica_application.go, line 71 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
stages:
Done.
pkg/storage/replica_application.go, line 77 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
s/have/has/
Done.
pkg/storage/replica_application.go, line 78 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
s/their/its/
Done.
pkg/storage/replica_application.go, line 79 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
, its side effects on the Replica's state is applied, and the clients acked with the respective results.
Done.
pkg/storage/replica_application.go, line 80 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Good job on the comment!
Thanks!
pkg/storage/replica_application.go, line 80 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Mention that errors from this method are fatal. (They still are, right)
Done.
pkg/storage/replica_application.go, line 85 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
At this point it's neither clear to me why this receives a refreshReason but even less why it returns a new one. Can you say something about this in a comment? There's probably a canonical situation in which this is necessary and it'll be helpful to prime readers for it.
I'm seeing below that this is only relevant when we see an empty entry, so how do you feel about just returning a
bool
or even just adding anumEmptyEntries
field in thestats
, so that the refresh reason mangling can live at the caller. I feel that this would remove an unnecessary distraction from this method.
Done.
pkg/storage/replica_application.go, line 125 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Handle this right after
err
is populated, pretty sure that's only unintentionally not the case right now.
Done.
pkg/storage/replica_application.go, line 133 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
When is
ok
initiallyfalse
?
If the batch were empty. I commented the method. It probably is a distraction as the batch shouldn't be empty.
pkg/storage/replica_application.go, line 134 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
state
is pretty opaque, do you think we can come up with something better? This is basically a "context for applying an entry. Maybe
entryApplicationStatecould become
applicationContext(or even just
application) and the local variable here could be
app. I don't know how I feel about the type renames I'm suggesting, maybe renaming to
sto
appis enough here. Or even just
cmd, since you're passing it to
stageRaftCommand. The reason not to go with that would be if
cmdis too ambiguous, but I quite like it. Or
appCmd` to make it clear that we're applying?
I did a big rename, entryApplicationState.*
-> cmdAppCtx.*
and state()
->cur()
pkg/storage/replica_application.go, line 135 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
it.isLast() /* writeAppliedState */
Done.
pkg/storage/replica_application.go, line 136 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Put some words about why this is handled here, looks sort of one-offy and there's a good reason I'm sure.
Added some, could add more.
pkg/storage/replica_application.go, line 155 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I was initially wondering why you did this iteration twice, but you're doing it because this is under the replica mutex.
Ack.
pkg/storage/replica_application.go, line 165 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Probably not worth it, but you can hoist this nil check out of the loop.
Done.
pkg/storage/replica_application.go, line 181 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
No error is returned to the caller, it will go to whoever is waiting on the proposal if proposed locally, and discarded otherwise.
Done.
pkg/storage/replica_application.go, line 190 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
The standard way in which this is triggered
This doesn't add much, consider something like
in this method via a side effect (in the proposal's ReplicatedEvalResult or, for local proposals, LocalEvalResult).
Done.
pkg/storage/replica_application.go, line 268 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
This comment needs a touch up.
Done.
pkg/storage/replica_application.go, line 337 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
This isn't true any more, see
cockroach/pkg/storage/batcheval/cmd_add_sstable.go
Lines 56 to 62 in c097a16
// *If* the keys in the SST do not conflict with keys currently in this range, // then adding the stats for this SST to the range stats should yield the // correct overall stats. // // *However*, if the keys in this range *do* overlap with keys already in this // range, then adding the SST semantically *replaces*, rather than adds, those // keys, and the net effect on the stats is not so simple.
Updated and added an NB
to try to state your concern with other entries in the batch preceding an AddSSTable.
pkg/storage/replica_application.go, line 382 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Am I reviewing this or is this an exact move with adjustments only?
It used to refer to Replica.mu.state
and now it takes a ReplicaState
argument, otherwise it's unchanged.
pkg/storage/replica_application.go, line 529 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
(which must be considered fatal)
s/is likely/is/
pkg/storage/replica_application.go, line 537 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Maybe not for this PR, but passing a pointer here doesn't make much sense.
WriteBatch
contains only a[]byte
, so we can just pass it by value and then thesewriteBatch != nil
checks just turn intolen(writeBatch.Data) > 0
. Or just pass a[]byte
in, that makes even more sense probably.
I reworked this method to just take a cmdAppCtx
which simplifies some things
pkg/storage/replica_application.go, line 549 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
All of these contend on a mutex (only in the write path), so it might be worth hoisting this out too.
Done.
pkg/storage/replica_application.go, line 578 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
In particular, if we ever got a
WriteBatch{}
here, we'd fail because (I think)ApplyBatchRepr
expects at least the magic header in the WriteBatch, so this check is better written aslen(writeBatch.Data) > 0
oncewriteBatch
itself is a value.
I hear you but generally didn't find pulling writeBatchData
out to be clearer now that cmd
is here.
pkg/storage/replica_application.go, line 665 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
This TODO is firmly in the "never gonna happen" department. Remove it.
Done.
pkg/storage/replica_application.go, line 698 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Rename to
RaftCommandApplyLatency
. I've been wondering what this number is more than once. (Hmm, this will rename the time series, but honestly it's still a good idea to do the rename)
That's tricky given the fact that this chart is in admin UI and we need to support mixed version clusters. This is a weird metric that isn't going to be as useful now so maybe we should change the admin UI to not have this chart and then in 20.1 we can rename the metric? Added a TODO
pkg/storage/replica_application.go, line 708 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
To me this method gives more ammunition to the argument that
isTrivial
should be a whitelist. Let's put this andisTrivial
next to each other so that they're more obviously in sync, too.Between those there's enough to pull out yet another file, I think.
replica_application_result.go
? This file has ~1.4k LOC and the side effects pull out sort of cleanly.
Done. I moved those functions and also carved out the portion of applyBatch (now applyCmdAppBatch
) that related to dealing with the final result of a command. See how it makes you feel.
pkg/storage/replica_application.go, line 722 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Is there any real benefit to including this in the "trivial" results? What's in here that happens frequently enough to matter? (I generally think we ought to batch only the stuff that makes us fast, not any bit more than that).
Only truncated state. Nathan and I also discussed making truncated state commands non-trivial and I'm still happy to do it. If we want to do it I'll do it on the next pass.
pkg/storage/replica_application.go, line 775 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
That second sentence is ... unexpected. Also, doesn't seem like the extra level of indentation below is needed.
Ugh, I hoisted most of that comment above but didn't clean up this trailing sentence.
pkg/storage/replica_application.go, line 810 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I'd rewrite this section as
combinedDelta := -sideloadTruncationDelta for /* ... */ { combinedDelta += s.replicatedResult().RaftLogDelta s.replicatedResult().RaftLogDelta = 0 } if combinedDelta != 0 { /* inline the closure */ }
You were careful here because you didn't want mixing between the two contributions? That should be OK. If we trust our log size (and only then is it used for truncation decisions) the clamping oughtn't change the outcome.
Done.
pkg/storage/replica_application.go, line 850 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
You could collect the suggestions in the RaftLogDelta iteration below. Not sure that's any better. It definitely isn't any clearer.
I moved it to the loop below instead.
pkg/storage/replica_application.go, line 866 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
This method should also sit next to
isTrivial
and the other one I commented about (stageTrivialReplicatedEvalResult
)?
Done.
pkg/storage/replica_application.go, line 877 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
If we move to a more conservative batching model, we may want to have another whitelist on what to assert. Though maybe this is OK. We should keep track on how many assertStateLocked calls we have (on
stats
).
Ack.
pkg/storage/replica_application.go, line 916 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
result in
Done.
pkg/storage/replica_application.go, line 917 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
nit:
*replicaResult
.Check this in race builds at the caller somewhere.
huh?
pkg/storage/replica_application.go, line 1045 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Comment is stale, the task is now built into the queues.
Done.
pkg/storage/replica_application.go, line 1187 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Move this and its receivers to
replica_application_state.go
?
It's now in cmd_app_ctx.go
however I feel less amazing about that filename.
pkg/storage/replica_application.go, line 1288 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Ditto
replica_application_batch.go
Done.
pkg/storage/replica_application.go, line 1370 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
From what I'm seeing below steady state traffic is always trivial, and I assume that'll stay this way no matter what we add, so it won't matter for performance and we should lower the cognitive burden as much as possible. Can we change it so that
!isTrivial()
implies "alone in batch", not just "last in batch"? It'll be nice to have the guarantee that no mixing whatsover will occur.For example, look at
AddSSTable
. We add the SST to the LSM before applying the command (we have to, the other way around doesn't work at all). But it means that the previous cmds in the batch shouldn't touch anything that the SST also touches, or the SST will lose precedence: the SST will be shadowed by writes in earlier commands. That's incorrect. On the other hand, we know inside of the AddSSTable command itself that we won't overlap with the WriteBatch of the command itself, because we know to not throw random stuff into theWriteBatch
incmd_add_sstable.go
.Also, what's the reason for using a blacklist? I'd much rather deal with commands accidentally not getting batched as opposed to adding something new, forgetting to add it here, and batching it in with unknown results. But either way, there needs to be a test (maybe you wrote it already) that fails whenever a new field is added, and forces an update to the test as a sign-off on awareness of
isTrivial
.
Done.
pkg/storage/replica_application.go, line 1395 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Can you write this as
appStates := b.appStates appStates.clear() *b = entryApplicationBatch{appStates: appStates}so I don't have to be nervous that a field isn't cleared now or after any future change?
Simplified but I need to hold on to the new decodeBuf.
pkg/storage/entry_application_state_buf_test.go, line 20 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
👀 does that imply that you're replacing this with a less overly simplistic one?
When I started typing it the data structure was more complicated than it ultimately needed to be. This test, small as it is covers 100% of entry_application_state_buf.go. I added some more comments to make it seems less silly.
380c745
to
a335ffb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)
pkg/storage/replica_application.go, line 38 at r5 (raw file):
Previously, ajwerner wrote…
Added more fields.
Though I didn't do anything with them. Probably I should add more metrics. Added a TODO.
pkg/storage/replica_application.go, line 633 at r6 (raw file):
cmd.replicatedResult().State.TruncatedState = nil cmd.replicatedResult().RaftLogDelta = 0 // TODO(ajwerner): consider moving this code.
@tbg I'm curious to hear how you feel about this here. It's the only place we really reach down and muck with the Replica
. If we hoisted this and then passed the raftMu.stateLoader
this wouldn't even need to be a method on Replica
.
a335ffb
to
da52af6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 10 files at r6, 4 of 4 files at r7.
Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @ajwerner and @tbg)
pkg/storage/cmd_app_batch.go, line 83 at r7 (raw file):
toProcess
pkg/storage/cmd_app_batch.go, line 89 at r7 (raw file):
In that case, the returned decodedCmdRequiresFlush is true to indicate ...
pkg/storage/cmd_app_batch.go, line 105 at r7 (raw file):
for len(toProcess) > 0 { e := &toProcess[0] // If we haven't already decoded this entry then we decode it into the
Why "If we haven't already"? We're always decoding into the scratch space here.
pkg/storage/cmd_app_ctx.go, line 70 at r7 (raw file):
idKey storagebase.CmdIDKey raftCmd storagepb.RaftCommand cc raftpb.ConfChange // only populated for conf changes
nit: cc and ccCtx aren't trivial structs, might be worth having them be pointers since they're barely ever populated. It probably isn't worth it.
pkg/storage/cmd_app_ctx_buf.go, line 38 at r7 (raw file):
} var entryApplicationStateBufNodeSyncPool = sync.Pool{
Rename
pkg/storage/replica_application.go, line 722 at r5 (raw file):
Previously, ajwerner wrote…
Only truncated state. Nathan and I also discussed making truncated state commands non-trivial and I'm still happy to do it. If we want to do it I'll do it on the next pass.
Raft truncation is pretty frequent, I'd worry that that'll backfire somehow. Mild preference to keep it trivial then. Maybe we'll manage to restructure the log truncation so that it's less invasive, too (i.e. not synced through a Raft command in the first place).
pkg/storage/replica_application.go, line 633 at r6 (raw file):
Previously, ajwerner wrote…
@tbg I'm curious to hear how you feel about this here. It's the only place we really reach down and muck with the
Replica
. If we hoisted this and then passed theraftMu.stateLoader
this wouldn't even need to be a method onReplica
.
I like that, but more as a follow-up than eagerly - this PR should land and it's becoming quite difficult to trace the correctness of these smaller changes through the diffs.
pkg/storage/replica_application.go, line 98 at r7 (raw file):
) (stats handleCommittedEntriesStats, errExpl string, err error) { var ( decodeBuf decodedRaftEntry
This escapes, right?
pkg/storage/replica_application.go, line 104 at r7 (raw file):
// The batch is always empty at the beginning of this loop. In each iteration, we either process a chunk of trivial entries or an individual nontrivial entry. The
haveDecodedEntry
branch always corresponds to a non-trivial entry, while theelse
branch processes either a single nontrivial entry or any number of nontrivial entries.
This might get a little more streamlined as far as the explanation goes if instead of handling a nontrivial command that is encountered first in the else
branch eagerly, you defer it via haveDecodedEntry
. Then you could say
The haveDecodedEntry branch corresponds to a nontrivial entry, while the else branch populates b with any number (possibly zero) of trivial entries.
I think that's worth doing, but I'll defer to your judgment.
pkg/storage/replica_application.go, line 141 at r7 (raw file):
retrieve
pkg/storage/replica_application.go, line 181 at r7 (raw file):
// proposal's content is wiped and we apply an empty log entry instead. If an // error occurs and the command was proposed locally, it will be communicated to // a waiting proposer. The two typical cases errors occur are lease mismatch (in
Grammar off
pkg/storage/replica_application.go, line 575 at r7 (raw file):
// // TODO(ajwerner): explore the costs and benefits of this use of Distinct(). // This call requires flushing the batch's writes but
the suspense!
pkg/storage/replica_application.go, line 637 at r7 (raw file):
cmd.replicatedResult().State.TruncatedState != nil if haveTruncatedState { apply, err := handleTruncatedStateBelowRaft(ctx, replicaState.TruncatedState,
Hmm, now I'm not sure this batches correctly. handleTruncatedStateBelowRaft
loads from the engine, which now fails to reflect any prior truncations in the same batch. This is all probably fine since we're throwing away all but the last truncation (and don't do anything with the information in the ones we throw away prior to tossing them). Add some commentary.
pkg/storage/replica_application_result.go, line 92 at r7 (raw file):
truncatedStateUpdated = haveState && replicatedResult.State.TruncatedState != nil if truncatedStateUpdated { replicaState.TruncatedState = replicatedResult.State.TruncatedState
This technically means that when multiple truncations are batched together, we'll just use the last one, right? I think this is OK since the actual log deletion is a side effect of whichever truncated state survives. Just making sure I understand this correctly.
pkg/storage/replica_application_result.go, line 131 at r7 (raw file):
replicatedResult.State.UsingAppliedStateKey = false } // ReplicaState.Stats was previously non-nullable which caused nodes to
Is this still current? Sounds like something we can just remove at this point (fine as a follow up).
pkg/storage/replica_application_result.go, line 150 at r7 (raw file):
// applied to the Replica's in-memory state. This method deals with applying // non-trivial side effects, notifying waiting clients, releasing latches, // stepping config changes, and asserting hard state as required.
nit: technically we're not stepping config changes. You could say "informing Raft about applied config changes"
pkg/storage/store.go, line 3613 at r7 (raw file):
// remote replicas will likely start campaigning. if elapsed >= defaultReplicaRaftMuWarnThreshold { log.Warningf(ctx, "handle raft ready: %.1fs [processed=%d]",
print some other numbers here, it'll be interesting.
da52af6
to
6cbe344
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR! @tbg can you give the change for the truncated state a glance before I bors this.
Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @tbg)
pkg/storage/cmd_app_batch.go, line 83 at r7 (raw file):
Previously, tbg (Tobias Grieger) wrote…
toProcess
Done.
pkg/storage/cmd_app_batch.go, line 89 at r7 (raw file):
Previously, tbg (Tobias Grieger) wrote…
In that case, the returned decodedCmdRequiresFlush is true to indicate ...
Cleaned up the logic as suggested.
pkg/storage/cmd_app_batch.go, line 105 at r7 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Why "If we haven't already"? We're always decoding into the scratch space here.
Stale.
pkg/storage/cmd_app_ctx.go, line 70 at r7 (raw file):
Previously, tbg (Tobias Grieger) wrote…
nit: cc and ccCtx aren't trivial structs, might be worth having them be pointers since they're barely ever populated. It probably isn't worth it.
Done.
pkg/storage/cmd_app_ctx_buf.go, line 38 at r7 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Rename
Done.
pkg/storage/replica_application.go, line 722 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Raft truncation is pretty frequent, I'd worry that that'll backfire somehow. Mild preference to keep it trivial then. Maybe we'll manage to restructure the log truncation so that it's less invasive, too (i.e. not synced through a Raft command in the first place).
Ack.
pkg/storage/replica_application.go, line 98 at r7 (raw file):
Previously, tbg (Tobias Grieger) wrote…
This escapes, right?
Indeed it did. Moved to the appCtxBatch struct.
pkg/storage/replica_application.go, line 104 at r7 (raw file):
Previously, tbg (Tobias Grieger) wrote…
// The batch is always empty at the beginning of this loop. In each iteration, we either process a chunk of trivial entries or an individual nontrivial entry. The
haveDecodedEntry
branch always corresponds to a non-trivial entry, while theelse
branch processes either a single nontrivial entry or any number of nontrivial entries.This might get a little more streamlined as far as the explanation goes if instead of handling a nontrivial command that is encountered first in the
else
branch eagerly, you defer it viahaveDecodedEntry
. Then you could sayThe haveDecodedEntry branch corresponds to a nontrivial entry, while the else branch populates b with any number (possibly zero) of trivial entries.
I think that's worth doing, but I'll defer to your judgment.
Done.
pkg/storage/replica_application.go, line 141 at r7 (raw file):
Previously, tbg (Tobias Grieger) wrote…
retrieve
I have a serious problem with these diphthong vowel orderings.
pkg/storage/replica_application.go, line 181 at r7 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Grammar off
I'm not totally sure about the grammar qualms. I tried to tighten up the grammar but most of my changes in this area but my changes were mostly just speculation.
pkg/storage/replica_application.go, line 575 at r7 (raw file):
Previously, tbg (Tobias Grieger) wrote…
the suspense!
Done.
pkg/storage/replica_application.go, line 637 at r7 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Hmm, now I'm not sure this batches correctly.
handleTruncatedStateBelowRaft
loads from the engine, which now fails to reflect any prior truncations in the same batch. This is all probably fine since we're throwing away all but the last truncation (and don't do anything with the information in the ones we throw away prior to tossing them). Add some commentary.
If we don't use the Distinct() batch it will be safe, right? I changed it to use the batch rather than the Distinct()
writer. I'd like your approval before I merge I merge this.
pkg/storage/replica_application_result.go, line 92 at r7 (raw file):
Previously, tbg (Tobias Grieger) wrote…
This technically means that when multiple truncations are batched together, we'll just use the last one, right? I think this is OK since the actual log deletion is a side effect of whichever truncated state survives. Just making sure I understand this correctly.
Correct, comment added.
pkg/storage/replica_application_result.go, line 131 at r7 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Is this still current? Sounds like something we can just remove at this point (fine as a follow up).
Added a TODO.
pkg/storage/store.go, line 3613 at r7 (raw file):
Previously, tbg (Tobias Grieger) wrote…
print some other numbers here, it'll be interesting.
Done.
6cbe344
to
b7eff72
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 7 of 7 files at r8.
Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @ajwerner)
pkg/storage/cmd_app_batch.go, line 90 at r8 (raw file):
// commands must always be in their own batch. If a non-trivial command is // encountered the batch is returned immediately without adding the newly // decoded command to the batch of removing it from remaining.
s/of/or/
pkg/storage/cmd_app_ctx_buf.go, line 38 at r7 (raw file):
Previously, ajwerner wrote…
Done.
Ping
pkg/storage/replica_application.go, line 917 at r5 (raw file):
Previously, ajwerner wrote…
huh?
Wish I could tell you.
pkg/storage/replica_application.go, line 181 at r7 (raw file):
Previously, ajwerner wrote…
I'm not totally sure about the grammar qualms. I tried to tighten up the grammar but most of my changes in this area but my changes were mostly just speculation.
Yeah the missing "in which" threw me off, my brain doesn't think that that's valid (regardless of whether it is), thanks for fixing.
pkg/storage/replica_application.go, line 637 at r7 (raw file):
Previously, ajwerner wrote…
If we don't use the Distinct() batch it will be safe, right? I changed it to use the batch rather than the
Distinct()
writer. I'd like your approval before I merge I merge this.
Yes, that checks out.
Can you rename distinctEng
in handleTruncatedStateBelowRaft
to just batch
?
I don't know if there's a penalty for using the batch after having used Distinct()
, but we probably do that already anyway. And for truncations, the right thing in the long run is to avoid having them triggered by Raft commands, so this particular use we're hoping to remove at some point.
Also, lol at the semantics of Distinct()
. I remember now that I wasted time on this in the past.
pkg/storage/replica_application.go, line 111 at r8 (raw file):
into
This commit batches raft command application where possible. The basic approach is to notice that many commands only "trivially" update the replica state machine. Trivial commands can be processed in a single batch by acting on a copy of the replica state. Non-trivial commands share the same logic but always commit alone as they for one reason or another rely on having a view of the replica or storage engine as of a specific log index. This commit also sneaks in another optimization which batching enables. Each command mutates a portion of replica state called the applied state which tracks a combination of the log index which has been applied and the MVCC stats of the range as of that application. Before this change each entry would update this applied state and each of those writes will end up in the WAL and mem-table just the be compacted away in L1. Now that commands are being applied to the storage engine in a single batch it is easy to only update the applied state for the last entry in the batch. For sequential writes this patch shows a considerable performance win. The below benchmark was run on a 3-node c5d.4xlarge (16 vCPU) cluster with concurrency 128. ``` name old ops/s new ops/s delta KV0-throughput 22.1k ± 1% 32.8k ± 1% +48.59% (p=0.029 n=4+4) name old ms/s new ms/s delta KV0-P50 7.15 ± 2% 6.00 ± 0% -16.08% (p=0.029 n=4+4) KV0-Avg 5.80 ± 0% 3.80 ± 0% -34.48% (p=0.029 n=4+4) ``` Due to the re-organization of logic in the change, the Replica.mu does not need to be acquired as many times during the application of a batch. In the common case it is now acquired exactly twice in the process of applying a batch whereas before it was acquired more than twice per entry. This should hopefully improve performance on large machines which experience mutex contention for a single range. This effect is visible on large machines. Below are results from running a normal KV0 workload on c5d.18xlarge machines (72 vCPU machines) with concurrency 1024 and 16 initial splits. ``` name old ops/s new ops/s delta KV0-throughput 78.1k ± 1% 116.8k ± 5% +49.42% (p=0.029 n=4+4) name old ms/s new ms/s delta KV0-P50 24.4 ± 3% 19.7 ± 7% -19.28% (p=0.029 n=4+4) KV0-Avg 12.6 ± 0% 7.5 ± 9% -40.87% (p=0.029 n=4+4) ``` Fixes cockroachdb#37426. Release note (performance improvement): Batch raft entry application and coalesce writes to applied state for the batch.
b7eff72
to
e4ce717
Compare
TFTR! bors r=nvanbenschoten,tbg |
38568: storage: batch command application and coalesce applied state per batch r=nvanbenschoten,tbg a=ajwerner This commit batches raft command application where possible. The basic approach is to notice that many commands only "trivially" update the replica state machine. Trivial commands can be processed in a single batch by acting on a copy of the replica state. Non-trivial commands share the same logic but always commit alone as they for one reason or another rely on having a view of the replica or storage engine as of a specific log index. This commit also sneaks in another optimization which batching enables. Each command mutates a portion of replica state called the applied state which tracks a combination of the log index which has been applied and the MVCC stats of the range as of that application. Before this change each entry would update this applied state and each of those writes will end up in the WAL and mem-table just the be compacted away in L0. Now that commands are being applied to the storage engine in a single batch it is easy to only update the applied state for the last entry in the batch. For sequential writes this patch shows a considerable performance win. The below benchmark was run on a 3-node c5d.4xlarge (16 vCPU) cluster with concurrency 128. ``` name old ops/s new ops/s delta KV0-throughput 22.1k ± 1% 32.8k ± 1% +48.59% (p=0.029 n=4+4) name old ms/s new ms/s delta KV0-P50 7.15 ± 2% 6.00 ± 0% -16.08% (p=0.029 n=4+4) KV0-Avg 5.80 ± 0% 3.80 ± 0% -34.48% (p=0.029 n=4+4) ``` Due to the re-organization of logic in the change, the Replica.mu does not need to be acquired as many times during the application of a batch. In the common case it is now acquired exactly twice in the process of applying a batch whereas before it was acquired more than twice per entry. This should hopefully improve performance on large machines which experience mutex contention for a single range. These optimizations combine for a big win on large machines. Below are results from running a normal KV0 workload on c5d.18xlarge machines (72 vCPU machines) with concurrency 1024 and 16 initial splits. ``` name old ops/s new ops/s delta KV0-throughput 78.1k ± 1% 116.8k ± 5% +49.42% (p=0.029 n=4+4) name old ms/s new ms/s delta KV0-P50 24.4 ± 3% 19.7 ± 7% -19.28% (p=0.029 n=4+4) KV0-Avg 12.6 ± 0% 7.5 ± 9% -40.87% (p=0.029 n=4+4) ``` Fixes #37426. Release note (performance improvement): Batch raft entry application and coalesce writes to applied state for the batch. Co-authored-by: Andrew Werner <[email protected]>
Build succeeded |
This commit batches raft command application where possible. The basic
approach is to notice that many commands only "trivially" update the
replica state machine. Trivial commands can be processed in a single
batch by acting on a copy of the replica state. Non-trivial commands
share the same logic but always commit alone as they for one reason or
another rely on having a view of the replica or storage engine as of a
specific log index.
This commit also sneaks in another optimization which batching enables.
Each command mutates a portion of replica state called the applied state
which tracks a combination of the log index which has been applied and the
MVCC stats of the range as of that application. Before this change each
entry would update this applied state and each of those writes will end up
in the WAL and mem-table just the be compacted away in L0. Now that
commands are being applied to the storage engine in a single batch it is easy
to only update the applied state for the last entry in the batch.
For sequential writes this patch shows a considerable performance win. The below
benchmark was run on a 3-node c5d.4xlarge (16 vCPU) cluster with concurrency 128.
Due to the re-organization of logic in the change, the Replica.mu does not need
to be acquired as many times during the application of a batch. In the common
case it is now acquired exactly twice in the process of applying a batch whereas
before it was acquired more than twice per entry. This should hopefully improve
performance on large machines which experience mutex contention for a single
range.
These optimizations combine for a big win on large machines. Below are results
from running a normal KV0 workload on c5d.18xlarge machines (72 vCPU machines) with
concurrency 1024 and 16 initial splits.
Fixes #37426.
Release note (performance improvement): Batch raft entry application and
coalesce writes to applied state for the batch.