Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: batch command application and coalesce applied state per batch #38568

Merged

Conversation

ajwerner
Copy link
Contributor

@ajwerner ajwerner commented Jun 28, 2019

This commit batches raft command application where possible. The basic
approach is to notice that many commands only "trivially" update the
replica state machine. Trivial commands can be processed in a single
batch by acting on a copy of the replica state. Non-trivial commands
share the same logic but always commit alone as they for one reason or
another rely on having a view of the replica or storage engine as of a
specific log index.

This commit also sneaks in another optimization which batching enables.
Each command mutates a portion of replica state called the applied state
which tracks a combination of the log index which has been applied and the
MVCC stats of the range as of that application. Before this change each
entry would update this applied state and each of those writes will end up
in the WAL and mem-table just the be compacted away in L0. Now that
commands are being applied to the storage engine in a single batch it is easy
to only update the applied state for the last entry in the batch.

For sequential writes this patch shows a considerable performance win. The below
benchmark was run on a 3-node c5d.4xlarge (16 vCPU) cluster with concurrency 128.

name            old ops/s   new ops/s   delta
KV0-throughput  22.1k ± 1%  32.8k ± 1%  +48.59%  (p=0.029 n=4+4)

name            old ms/s    new ms/s    delta
KV0-P50          7.15 ± 2%   6.00 ± 0%  -16.08%  (p=0.029 n=4+4)
KV0-Avg          5.80 ± 0%   3.80 ± 0%  -34.48%  (p=0.029 n=4+4)

Due to the re-organization of logic in the change, the Replica.mu does not need
to be acquired as many times during the application of a batch. In the common
case it is now acquired exactly twice in the process of applying a batch whereas
before it was acquired more than twice per entry. This should hopefully improve
performance on large machines which experience mutex contention for a single
range.

These optimizations combine for a big win on large machines. Below are results
from running a normal KV0 workload on c5d.18xlarge machines (72 vCPU machines) with
concurrency 1024 and 16 initial splits.

name            old ops/s   new ops/s    delta
KV0-throughput  78.1k ± 1%  116.8k ± 5%  +49.42%  (p=0.029 n=4+4)

name            old ms/s    new ms/s     delta
KV0-P50          24.4 ± 3%    19.7 ± 7%  -19.28%  (p=0.029 n=4+4)
KV0-Avg          12.6 ± 0%     7.5 ± 9%  -40.87%  (p=0.029 n=4+4)

Fixes #37426.

Release note (performance improvement): Batch raft entry application and
coalesce writes to applied state for the batch.

@ajwerner ajwerner requested review from nvanbenschoten and a team June 28, 2019 22:54
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@ajwerner ajwerner force-pushed the ajwerner/batch-command-application branch 4 times, most recently from a57a49c to e32ad08 Compare June 29, 2019 15:35
@ajwerner ajwerner changed the title storage: batch command application storage: batch command application and coalesce applied state per batch Jun 29, 2019
@ajwerner ajwerner force-pushed the ajwerner/batch-command-application branch from e32ad08 to 006f72e Compare July 1, 2019 13:11
@ajwerner
Copy link
Contributor Author

ajwerner commented Jul 1, 2019

In the initial version of this revision I had been accidentally stepping the raft machine for each entry (instead of just conf changes). This was remedied early on but I hadn't re-run the benchmarks. They have now been updated to show a solid 48% throughput win on KV0/seq=true.

I'm sure there's more commentary to add to this PR and I'm sure I've made some things stale but on the whole it's a pretty exciting win.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool!

Before this change each entry would update this applied state and each of those writes will end up
in the WAL, mem-table, and L0 just the be compacted away in L1

My understanding was that they would be collapsed when compacting from the mem-table to L0, not from L0 to L1. Am I confused?

Reviewed 4 of 7 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)


pkg/storage/application_state_buf.go, line 21 at r1 (raw file):

// applicationStateBufNode.
// TODO(ajwerner): justify this number.
const applicationStateBufSize = 8

s/applicationStateBufSize/applicationStateBufChunkSize/


pkg/storage/replica_application.go, line 37 at r1 (raw file):

)

// applicationState stores the state required to apply an entry to a

s/an entry/a single entry/


pkg/storage/replica_application.go, line 39 at r1 (raw file):

// applicationState stores the state required to apply an entry to a
// replica. The state is accumulated in stages which occur in
// Replica.handleCommittedEntriesRaftMuLocked. From a high level, a command is

Enumerating the steps here would be helpful. Or maybe add a bit of pseudo code like we have in #37426.


pkg/storage/replica_application.go, line 46 at r1 (raw file):

// Then the batch is committed, the side-effects are applied and the local
// result is processed.
type applicationState struct {

nit: something about the name applicationState seems particularly overloaded to me in the world of computer science. I know you're using "application" to mean Raft entry application, but it still jumps out at me after reviewing this entire PR. It also doesn't do a good job conveying that this is per-entry state and nothing more. Did you consider any other names? Maybe:

  • entryApplication
  • entryApplicationState
  • entAppliedState
  • entAppState

pkg/storage/replica_application.go, line 50 at r1 (raw file):

	// proposal's context if this entry was proposed locally.
	ctx context.Context
	// e is th Entry being applied.

the


pkg/storage/replica_application.go, line 54 at r1 (raw file):

	// The below fields are decoded from e.

Remove this line so that it's clear that you're just referring to this group of four fields.

I do like that you added this comment and grouped the fields like this though. It would be helpful to a reader of this code if all of the groups of fields were commented as well. For instance, the next group would be // The below fields are populated on the proposing Replica only and come from the Replica's proposal map.


pkg/storage/replica_application.go, line 60 at r1 (raw file):

	ccCtx   ConfChangeContext // only populated for conf changes

	proposedLocally bool

Consider making this a method to reduce the footprint of the applicationStateBuf.


pkg/storage/replica_application.go, line 67 at r1 (raw file):

	leaseIndex uint64
	lResult    *result.LocalResult

nit: s/lResult/localResult/


pkg/storage/replica_application.go, line 81 at r1 (raw file):

// At the time of writing it is possible that the current conditions are too
// strict but they are certainly sufficient.
func isTrivial(usingStateAppliedKey bool, r *storagepb.ReplicatedEvalResult) bool {

nit: move the *storagepb.ReplicatedEvalResult up because that's the more important argument.


pkg/storage/replica_application.go, line 94 at r1 (raw file):

				r.State.Lease == nil &&
				r.State.Desc == nil &&
				(usingStateAppliedKey || !r.State.UsingAppliedStateKey)))

I would have expected this to be (!usingStateAppliedKey || r.State.UsingAppliedStateKey). That's the condition where we upgrade to using the applied state key, right?

Also, it would almost certainly be ok to consider all commands during the applied state key upgrade non-trivial if that simplifies the code. I think that would allow us to get rid of the usingStateAppliedKey arg.


pkg/storage/replica_application.go, line 102 at r1 (raw file):

}

// decode decodes s.e into the other fields of s.

It seems like this could be merged with initialize to simplify the usage of applicationState.


pkg/storage/replica_application.go, line 149 at r1 (raw file):

// commands. Committed raft commands are applied to the batch in a multi-stage
// process whereby individual commands are decoded, prepared for application
// relative to the current view of replicaState.

This sentence seemed to end abruptly.


pkg/storage/replica_application.go, line 154 at r1 (raw file):

	stats                 enginepb.MVCCStats
	replicaState          storagepb.ReplicaState
	updatedTruncatedState bool

Why are we allowing multiple Raft log truncations in a single batch instead of making them non-trivial commands?


pkg/storage/replica_application.go, line 182 at r1 (raw file):

applying it its batch


pkg/storage/replica_application.go, line 185 at r1 (raw file):

// This function is called after a batch has been written to the storage engine.
// For trivial commands this function should result is a zero value rResult.
func clearTrivialReplicatedEvalResultFields(

This function seems to belong right above handleComplexReplicatedEvalResult.


pkg/storage/replica_application.go, line 190 at r1 (raw file):

	// Fields for which no action is taken in this method are zeroed so that
	// they don't trigger an assertion at the end of the method (which checks
	// that all fields were handled).

It doesn't trigger an assertion anymore.


pkg/storage/replica_application.go, line 237 at r1 (raw file):

		log.Infof(ctx, "flushing batch %v %v", b.replicaState, b.appStates.len)
	}
	// TODO(ajwerner): clean up this sanity check

Reminder to address this TODO.


pkg/storage/replica_application.go, line 242 at r1 (raw file):

	}

	if err := b.batch.Commit(false); err != nil {

Might as well leave a note about why we don't need to sync here.


pkg/storage/replica_application.go, line 296 at r1 (raw file):

	r.mu.state.RaftAppliedIndex = b.replicaState.RaftAppliedIndex
	r.mu.state.LeaseAppliedIndex = b.replicaState.LeaseAppliedIndex
	deltaStats.Subtract(*r.mu.state.Stats)

nit: you can assign *r.mu.state.Stats to a variable and perform the subtraction outside of the lock.


pkg/storage/replica_application.go, line 356 at r1 (raw file):

		// before notifying a potentially waiting client.
		clearTrivialReplicatedEvalResultFields(b.replicaState.UsingAppliedStateKey, rResult)
		if isNonTrivial && !r.handleComplexReplicatedEvalResult(s.ctx, *rResult) {

Did you consider passing rResult as a pointer here, letting handleComplexReplicatedEvalResult modify it, and then performing the rResult.Equal(storagepb.ReplicatedEvalResult{}) for both trivial and non-trivial commands in the same place?


pkg/storage/replica_application.go, line 395 at r1 (raw file):

			}
		}

stray line


pkg/storage/replica_application.go, line 403 at r1 (raw file):

}

// handleCommittedEntriesRaftMuLocked deals with the complexities involved in

Consider adding the term "replicated state machine" in here somewhere. This is fundamentally the entry point for which all updates to the state machine are made.


pkg/storage/replica_application.go, line 403 at r1 (raw file):

}

// handleCommittedEntriesRaftMuLocked deals with the complexities involved in

nit: this function drives the rest of these functions, so I'd move it up above applyBatch. In general, I think this file would be easiest to read if everything was in the order that it is called during entry application, mod type definitions and methods (putEntryApplicationBatch and up), which make sense where they are.


pkg/storage/replica_application.go, line 418 at r1 (raw file):

// their ReplicatedEvalResult applied to the batch's view of the ReplicaState.
// Finally the batch is written to the storage engine and its side effects on
// the Replicas state are applied.

Replica's


pkg/storage/replica_application.go, line 419 at r1 (raw file):

// Finally the batch is written to the storage engine and its side effects on
// the Replicas state are applied.
func (r *Replica) handleCommittedEntriesRaftMuLocked(

Add to the comment about the stats and refreshReason arguments.


pkg/storage/replica_application.go, line 438 at r1 (raw file):

	// state.
	b := getEntryApplicationBatch()
	defer putEntryApplicationBatch(b)

s/putEntryApplicationBatch/releaseEntryApplicationBatch/


pkg/storage/replica_application.go, line 441 at r1 (raw file):

	// Decode commands into the batch until we find a non-trivial one.
	// When we find a non-trivial one, unpush it, set committed entries to be that one next.

This comment is confusing. What do you mean by "unpush" it? What does "committed entries to be that one next" mean?


pkg/storage/replica_application.go, line 451 at r1 (raw file):

			// entries in the current batch then we'll unpush it, apply the batch and
			// then push it back.
			s := b.appStates.pushBack()

"pushBack" doesn't seem like the right name for this method because it's not pushing anything. Something like allocateBack


pkg/storage/replica_application.go, line 485 at r1 (raw file):

			// batch.
			if !isTrivial(b.replicaState.UsingAppliedStateKey, &s.raftCmd.ReplicatedEvalResult) {
				// If there are no other entries already in the batch then we pop this

I think part of the reason why this code gets tricky is because we're defining a non-trivial command as one that both a) relies on in-memory state matching durable state and b) performs side-effects that all later commands need to observe during their application. In other words, we're defining a non-trivial command as one that needs to be applied in its own batch without any other entries before or after it.

I'm wondering if this is overly restrictive. My intuition would be that we could probably get away with having non-trivial commands be part of a batch as long as they are the last command in the batch. This construction would dramatically simplify the code structure here because it would turn the pushing and unpushing process into a simple process of: 1. batch as much as possible, 2: flush after each non-trivial command is added to the batch, 3. flush again at the end. It would also allow us to think of each batch as having n "trivial side effects" and one "complex side effect", which is easy to understand and to structure in the code.

This would require reworking "part a" of the definition of a non-trivial command, but I suspect that it's not quite right anyway. The non-trivial commands don't require in-memory state to match durable state before they apply, just after when performing their side effects, right?


pkg/storage/replica_application.go, line 518 at r1 (raw file):

		}
		r.mu.Unlock()
		b.replicaState.Stats = &b.stats

nit: This isn't performing an allocation, so there's no need to move it outside of the critical section if doing so hurts readability. I'd suggest doing something simpler like:

	retreiveLocalProposals := func() {
		var it applicationStateBufIterator
		r.mu.Lock()
		defer r.mu.Unlock()
		b.replicaState = r.mu.state
		b.replicaState.Stats = &b.stats
		*b.replicaState.Stats = *r.mu.state.Stats

pkg/storage/replica_application.go, line 537 at r1 (raw file):

		retreiveLocalProposals()
		b.batch = r.store.engine.NewBatch()
		processBatch()

processBatch and applyBatch sound basically the same. Consider s/processBatch/stageBatch/.


pkg/storage/replica_application.go, line 552 at r1 (raw file):

and it is checked whether the request's key span is contained in the Replica's

This part isn't true because we pass roachpb.RSpan{} to requestCanProceed.


pkg/storage/replica_application.go, line 558 at r1 (raw file):

violations of the LeaseAppliedIndex (in which the caller tries again)

The caller tries again? Don't we do that below Raft with tryReproposeWithNewLeaseIndex?


pkg/storage/replica_application.go, line 560 at r1 (raw file):

should be applied

That's done by this function. This gives the impression that it's the caller's responsibility.


pkg/storage/replica_application.go, line 944 at r1 (raw file):

	return truncatedStateUpdated

stray line


pkg/storage/replica_application.go, line 1047 at r1 (raw file):

provides is performed

The wording is off here.


pkg/storage/replica_application.go, line 1173 at r1 (raw file):

	}

	return s.raftCmd.ReplicatedEvalResult.ChangeReplicas != nil

This is a strange return value. Why doesn't the caller use the applicationState that it already has in scope?


pkg/storage/replica_application.go, line 1181 at r1 (raw file):

// handled by the caller.
// The returned ReplicatedEvalResult replaces the caller's.
func (r *Replica) applyRaftCommand(

s/applyRaftCommand/applyRaftCommandToBatch/

Also, the ordering of functions/methods comment applies here as well.


pkg/storage/replica_application.go, line 1235 at r1 (raw file):

	// The only remaining use of the batch is for range-local keys which we know
	// have not been previously written within this batch.
	writer := batch.Distinct()

What did we determine about this? Should we do it at all or is it a pessimization?


pkg/storage/replica_raft.go, line 683 at r1 (raw file):

	if len(rd.CommittedEntries) > 0 {
		var expl string
		expl, err = r.handleCommittedEntriesRaftMuLocked(ctx, rd.CommittedEntries, &stats, &refreshReason)

Can you double check that this isn't causing stats or refreshReason to escape to the heap?


pkg/storage/track_raft_protos.go, line 37 at r1 (raw file):

func TrackRaftProtos() func() []reflect.Type {
	// Grab the name of the function that roots all raft operations.
	processRaftFunc := funcName((*Replica).stageRaftCommand)

s/processRaftFunc/stageRaftCommand/

@ajwerner ajwerner force-pushed the ajwerner/batch-command-application branch from 006f72e to ddc446d Compare July 4, 2019 16:27
Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR! Sorry for the mess

My understanding was that they would be collapsed when compacting from the mem-table to L0, not from L0 to L1. Am I confused?

You are correct, I was incorrect. Wishful thinking.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/storage/application_state_buf.go, line 21 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

s/applicationStateBufSize/applicationStateBufChunkSize/

Went with entryApplicationStateBufNodeSize


pkg/storage/replica_application.go, line 37 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

s/an entry/a single entry/

Done.


pkg/storage/replica_application.go, line 39 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Enumerating the steps here would be helpful. Or maybe add a bit of pseudo code like we have in #37426.

I added the pseudocode to handleCommittedEntriesRaftMuLocked,


pkg/storage/replica_application.go, line 46 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: something about the name applicationState seems particularly overloaded to me in the world of computer science. I know you're using "application" to mean Raft entry application, but it still jumps out at me after reviewing this entire PR. It also doesn't do a good job conveying that this is per-entry state and nothing more. Did you consider any other names? Maybe:

  • entryApplication
  • entryApplicationState
  • entAppliedState
  • entAppState

It was all entryApplication... and then I did a big s/entryApplication/application/g though I clearly even typed some layer and missed (put|get)EntryApplicationBatch. I wasn't happy with either of them but I'm happy to go back.


pkg/storage/replica_application.go, line 54 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Remove this line so that it's clear that you're just referring to this group of four fields.

I do like that you added this comment and grouped the fields like this though. It would be helpful to a reader of this code if all of the groups of fields were commented as well. For instance, the next group would be // The below fields are populated on the proposing Replica only and come from the Replica's proposal map.

Added more structure and comments here generally.


pkg/storage/replica_application.go, line 60 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Consider making this a method to reduce the footprint of the applicationStateBuf.

Done.


pkg/storage/replica_application.go, line 67 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: s/lResult/localResult/

Done.


pkg/storage/replica_application.go, line 81 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: move the *storagepb.ReplicatedEvalResult up because that's the more important argument.

Done.


pkg/storage/replica_application.go, line 94 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I would have expected this to be (!usingStateAppliedKey || r.State.UsingAppliedStateKey). That's the condition where we upgrade to using the applied state key, right?

Also, it would almost certainly be ok to consider all commands during the applied state key upgrade non-trivial if that simplifies the code. I think that would allow us to get rid of the usingStateAppliedKey arg.

I added a comment. This logic does consider the upgrade to be non-trivial. It just considers not upgrading to be trivial.


pkg/storage/replica_application.go, line 102 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

It seems like this could be merged with initialize to simplify the usage of applicationState.

Done.


pkg/storage/replica_application.go, line 149 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This sentence seemed to end abruptly.

It does indeed. Fixed.


pkg/storage/replica_application.go, line 154 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Why are we allowing multiple Raft log truncations in a single batch instead of making them non-trivial commands?

Why not? Their side-effect should coalesce nicely and is handled to just take latest one. I was mostly motivated by this comment:

		// Raft log truncation is too frequent to justify a replica state
		// assertion.

https://github.com/cockroachdb/cockroach/blob/master/pkg/storage/replica_proposal.go#L593-L594

We did discuss this offline and I'm open to making this change but will at least defer it to the next revision.


pkg/storage/replica_application.go, line 185 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This function seems to belong right above handleComplexReplicatedEvalResult.

Done.


pkg/storage/replica_application.go, line 190 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

It doesn't trigger an assertion anymore.

It does trigger an assert, just not at the end of the method. Updated.


pkg/storage/replica_application.go, line 237 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Reminder to address this TODO.

Done.


pkg/storage/replica_application.go, line 242 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Might as well leave a note about why we don't need to sync here.

Done.


pkg/storage/replica_application.go, line 296 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: you can assign *r.mu.state.Stats to a variable and perform the subtraction outside of the lock.

Done.


pkg/storage/replica_application.go, line 356 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Did you consider passing rResult as a pointer here, letting handleComplexReplicatedEvalResult modify it, and then performing the rResult.Equal(storagepb.ReplicatedEvalResult{}) for both trivial and non-trivial commands in the same place?

I did, there is code downstream of this which assumes that rResult hasn't been modified. I could re-work it but for now this seems easier. If after the next pass we think it's worth the change I can do it. Namely in finishRaftCommand there are some assertions and then also when looking to decide whether we aborted the config change.


pkg/storage/replica_application.go, line 403 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: this function drives the rest of these functions, so I'd move it up above applyBatch. In general, I think this file would be easiest to read if everything was in the order that it is called during entry application, mod type definitions and methods (putEntryApplicationBatch and up), which make sense where they are.

I've reordered the function to make this file flow better, see how you feel about it now.


pkg/storage/replica_application.go, line 403 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Consider adding the term "replicated state machine" in here somewhere. This is fundamentally the entry point for which all updates to the state machine are made.

Done.


pkg/storage/replica_application.go, line 418 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Replica's

Done.


pkg/storage/replica_application.go, line 419 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Add to the comment about the stats and refreshReason arguments.

Done.


pkg/storage/replica_application.go, line 438 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

s/putEntryApplicationBatch/releaseEntryApplicationBatch/

Done.


pkg/storage/replica_application.go, line 441 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This comment is confusing. What do you mean by "unpush" it? What does "committed entries to be that one next" mean?

No longer applies.


pkg/storage/replica_application.go, line 451 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

"pushBack" doesn't seem like the right name for this method because it's not pushing anything. Something like allocateBack

Done.


pkg/storage/replica_application.go, line 485 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I think part of the reason why this code gets tricky is because we're defining a non-trivial command as one that both a) relies on in-memory state matching durable state and b) performs side-effects that all later commands need to observe during their application. In other words, we're defining a non-trivial command as one that needs to be applied in its own batch without any other entries before or after it.

I'm wondering if this is overly restrictive. My intuition would be that we could probably get away with having non-trivial commands be part of a batch as long as they are the last command in the batch. This construction would dramatically simplify the code structure here because it would turn the pushing and unpushing process into a simple process of: 1. batch as much as possible, 2: flush after each non-trivial command is added to the batch, 3. flush again at the end. It would also allow us to think of each batch as having n "trivial side effects" and one "complex side effect", which is easy to understand and to structure in the code.

This would require reworking "part a" of the definition of a non-trivial command, but I suspect that it's not quite right anyway. The non-trivial commands don't require in-memory state to match durable state before they apply, just after when performing their side effects, right?

Done.


pkg/storage/replica_application.go, line 518 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: This isn't performing an allocation, so there's no need to move it outside of the critical section if doing so hurts readability. I'd suggest doing something simpler like:

	retreiveLocalProposals := func() {
		var it applicationStateBufIterator
		r.mu.Lock()
		defer r.mu.Unlock()
		b.replicaState = r.mu.state
		b.replicaState.Stats = &b.stats
		*b.replicaState.Stats = *r.mu.state.Stats

Done.


pkg/storage/replica_application.go, line 537 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

processBatch and applyBatch sound basically the same. Consider s/processBatch/stageBatch/.

I just inlined the closure as it wasn't providing a lot of additional clarity.


pkg/storage/replica_application.go, line 552 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
and it is checked whether the request's key span is contained in the Replica's

This part isn't true because we pass roachpb.RSpan{} to requestCanProceed.

It's a stale commented but updated nevertheless


pkg/storage/replica_application.go, line 558 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
violations of the LeaseAppliedIndex (in which the caller tries again)

The caller tries again? Don't we do that below Raft with tryReproposeWithNewLeaseIndex?

Stale comment, updated.


pkg/storage/replica_application.go, line 560 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
should be applied

That's done by this function. This gives the impression that it's the caller's responsibility.

Updated.


pkg/storage/replica_application.go, line 944 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

stray line

Done.


pkg/storage/replica_application.go, line 1047 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
provides is performed

The wording is off here.

Done.


pkg/storage/replica_application.go, line 1173 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This is a strange return value. Why doesn't the caller use the applicationState that it already has in scope?

This is old logic where this returned straight up to the loop in handleRaftReady. I've now hauled the check up.


pkg/storage/replica_application.go, line 1181 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

s/applyRaftCommand/applyRaftCommandToBatch/

Also, the ordering of functions/methods comment applies here as well.

Done.


pkg/storage/replica_application.go, line 1235 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

What did we determine about this? Should we do it at all or is it a pessimization?

Seems like it's a draw.

name            old ops/s   new ops/s   delta
KV0-throughput  32.7k ± 3%  32.7k ± 2%   ~     (p=1.000 n=4+4)

name            old ms/s    new ms/s    delta
KV0-P50          5.95 ± 3%   5.95 ± 3%   ~     (p=1.000 n=4+4)
KV0-Avg          3.77 ± 3%   3.82 ± 2%   ~     (p=0.629 n=4+4)

I did try just sharing a single batch.Distinct() after implementing the ApplyBatchRepr method on the distinctBatch because it seemed easier to try it than to read the code to figure out if anybody does any reads below here. It showed a win but failed a server test, I figured I could go explore that later. Note that the one with a shared Distinct() is old in the benchmark.

name            old ops/s   new ops/s   delta
KV0-throughput  33.5k ± 1%  32.7k ± 1%  -2.41%  (p=0.029 n=4+4)

name            old ms/s    new ms/s    delta
KV0-P50          5.80 ± 0%   5.95 ± 3%    ~     (p=0.143 n=4+4)
KV0-Avg          3.70 ± 0%   3.80 ± 0%  +2.70%  (p=0.029 n=4+4)

pkg/storage/replica_raft.go, line 683 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Can you double check that this isn't causing stats or refreshReason to escape to the heap?

./replica_raft.go:687:78: (*Replica).handleRaftReadyRaftMuLocked &stats does not escape

but weirdly enough

./replica_raft.go:687:86: &refreshReason escapes to heap
./replica_raft.go:446:2: moved to heap: refreshReason

Fixed.


pkg/storage/track_raft_protos.go, line 37 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

s/processRaftFunc/stageRaftCommand/

Done.

@ajwerner ajwerner force-pushed the ajwerner/batch-command-application branch 2 times, most recently from 880122d to b65d72c Compare July 6, 2019 21:15
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 7 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)


pkg/storage/entry_application_state_buf.go, line 90 at r2 (raw file):

}

// entryApplicationStateRingBuf is a ring-buffer of entryApplicationState.

The ring-buffer structure here isn't necessary anymore. Should we keep it at the cost of added complexity? I think we can get rid of the entire entryApplicationStateRingBuf type if we want.


pkg/storage/replica_application.go, line 154 at r1 (raw file):

Previously, ajwerner wrote…

Why not? Their side-effect should coalesce nicely and is handled to just take latest one. I was mostly motivated by this comment:

		// Raft log truncation is too frequent to justify a replica state
		// assertion.

https://github.com/cockroachdb/cockroach/blob/master/pkg/storage/replica_proposal.go#L593-L594

We did discuss this offline and I'm open to making this change but will at least defer it to the next revision.

We discussed this offline. The conclusion was that unlike all of the other side-effects, this one modifies the write batch directly so it doesn't fit into the trivial/non-trivial split. I wonder whether that's worth changing to simplify this code structure. If not, we should add a comment explaining why this is different than all of the other non-trivial side effects.


pkg/storage/replica_application.go, line 356 at r1 (raw file):

Previously, ajwerner wrote…

I did, there is code downstream of this which assumes that rResult hasn't been modified. I could re-work it but for now this seems easier. If after the next pass we think it's worth the change I can do it. Namely in finishRaftCommand there are some assertions and then also when looking to decide whether we aborted the config change.

Yeah I'd like to come back to this, but it can wait until everything else settles down.


pkg/storage/replica_application.go, line 485 at r1 (raw file):

Previously, ajwerner wrote…

Done.

Much better, thanks.


pkg/storage/replica_application.go, line 1047 at r1 (raw file):

Previously, ajwerner wrote…

Done.

prepareLocalResultis missing a space


pkg/storage/replica_application.go, line 1173 at r1 (raw file):

Previously, ajwerner wrote…

This is old logic where this returned straight up to the loop in handleRaftReady. I've now hauled the check up.

Looks like we're still returning this though.


pkg/storage/replica_application.go, line 1235 at r1 (raw file):

I did try just sharing a single batch.Distinct() after implementing the ApplyBatchRepr method on the distinctBatch because it seemed easier to try it than to read the code to figure out if anybody does any reads below here. It showed a win but failed a server test, I figured I could go explore that later.

That might be because some non-trivial entries need to read from the current batch. It seems worthwhile to explore this at some point.


pkg/storage/replica_application.go, line 53 at r2 (raw file):

// The pseudocode looks like:
//
//   while entries:

Thanks for this, it's great!


pkg/storage/replica_application.go, line 80 at r2 (raw file):

	ctx context.Context,
	committedEntries []raftpb.Entry,
	stats *handleRaftReadyStats,

I just noticed that this stats object only contains information about entry application at the moment. How do you feel about lifting it into here entirely? That prevents us from adding non-application related stats, so maybe that's not ideal. If not, how do you feel about creating a handleCommittedEntriesStats object and returning it from here?


pkg/storage/replica_application.go, line 310 at r2 (raw file):

		// values) here. If the key range we are ingesting into isn't empty,
		// we're not using AddSSTable but a plain WriteBatch.
		if s.rResult().AddSSTable != nil {

Do you mind verifying that each of these pre-apply effects still work correctly now that we allow trivial entries to proceed a non-trivial entry? I think the only change we'll need to make is to adjust preDestroyRaftMuLocked to use the current batch instead of tmpBatch, but it would be nice to verify all of them.


pkg/storage/replica_application.go, line 718 at r2 (raw file):

// not modify rResult in order to give the TestingPostApplyFilter testing knob
// an opportunity to inspect the command's ReplicatedEvalResult.
func applyTrivialReplicatedEvalResult(

Should this be stageTrivialReplicatedEvalResult because it's not modifying the real replicated state?


pkg/storage/replica_application.go, line 741 at r2 (raw file):

Upon success

or error


pkg/storage/replica_application.go, line 780 at r2 (raw file):

	// Now that the batch is committed we can go about applying the side effects
	// of the update to the truncated state. Note that this is safe only if the
	// new truncated state is durably on disk (i.e.) synced.

Consider linking to the new issue you just filed.


pkg/storage/replica_application.go, line 826 at r2 (raw file):

		s.rResult().RaftLogDelta = 0
	}
	applyRaftLogDelta(-1 * truncationDelta)

This is subtle. Are you favoring this over marking each entry's RaftLogDelta as 0 like we currently do? Could you add a comment explaining what's going on here?


pkg/storage/replica_application.go, line 977 at r2 (raw file):

	// in-memory and on-disk ReplicaStates. If any of these actions are present,
	// we want to assert that these two states do not diverge.
	shouldAssert = !rResult.Equal(storagepb.ReplicatedEvalResult{})

Is this ever false now?


pkg/storage/replica_application.go, line 1238 at r2 (raw file):

}

func (s *entryApplicationState) rResult() *storagepb.ReplicatedEvalResult {

nit: we swapped from lResult to localResult but then added this rResult method. We should keep them symmetrical. lResult and rResult are ok, although I find that I never remember what they mean when skimming this code. I think changing this method to replicatedResult would be the best solution.


pkg/storage/replica_application.go, line 1246 at r2 (raw file):

}

// decode initialized the entryApplicationState with ctx and e and then

initializes


pkg/storage/replica_application.go, line 1250 at r2 (raw file):

func (s *entryApplicationState) decode(
	ctx context.Context, e *raftpb.Entry,
) (expl string, err error) {

nit: consider s/expl/errExpl/ here and below.


pkg/storage/replica_application.go, line 1342 at r2 (raw file):

	ctx context.Context, toProcess []raftpb.Entry,
) (sawEmptyEntry bool, remaining []raftpb.Entry, expl string, err error) {
	for i := range toProcess {

nit: mutating the toProcess slice directly would make this a little cleaner. Something like:

for len(toProcess) > 0 {
    e := &toProcess[0]
    toProcess = toProcess[1:]
    ...
    if !isTrivial(...) {
        break
    }
}
return toProcess

pkg/storage/replica_application.go, line 1363 at r2 (raw file):

which implies that it can be applied in a batch with other trivial commands

This is a little off now. So is the "A result is fundamentally considered "trivial" if it does ..." sentence,


pkg/storage/replica_application.go, line 1392 at r2 (raw file):

	b.batch = nil
	b.updatedTruncatedState = false
	// TODO(ajwerner): could clear the replicaState and stats but it's not needed

The replicaState contains a number of nested pointers, so it's probably worth clearing that out.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)


pkg/storage/replica_application.go, line 537 at r1 (raw file):

Previously, ajwerner wrote…

I just inlined the closure as it wasn't providing a lot of additional clarity.

s/retreive/retrieve/

@ajwerner ajwerner force-pushed the ajwerner/batch-command-application branch from b65d72c to 715fd6d Compare July 8, 2019 19:44
Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. This is getting there. The biggest outstanding thing is the preDestroyRaftMuLocked comment.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/storage/entry_application_state_buf.go, line 90 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

The ring-buffer structure here isn't necessary anymore. Should we keep it at the cost of added complexity? I think we can get rid of the entire entryApplicationStateRingBuf type if we want.

Done.


pkg/storage/replica_application.go, line 154 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

We discussed this offline. The conclusion was that unlike all of the other side-effects, this one modifies the write batch directly so it doesn't fit into the trivial/non-trivial split. I wonder whether that's worth changing to simplify this code structure. If not, we should add a comment explaining why this is different than all of the other non-trivial side effects.

Another point is that truncated state side-effects don't have a trigger that needs to view the replica's in-memory state. The fact that the command touches ReplicaState doesn't make it non-trivial per se. All commands update the MVCCStats and lease applied index.

Anyway, added more commentary and a TODO


pkg/storage/replica_application.go, line 356 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Yeah I'd like to come back to this, but it can wait until everything else settles down.

Okay, I've cleaned this up a tad more.


pkg/storage/replica_application.go, line 1047 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

prepareLocalResultis missing a space

Done.


pkg/storage/replica_application.go, line 1173 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Looks like we're still returning this though.

Done.


pkg/storage/replica_application.go, line 1235 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I did try just sharing a single batch.Distinct() after implementing the ApplyBatchRepr method on the distinctBatch because it seemed easier to try it than to read the code to figure out if anybody does any reads below here. It showed a win but failed a server test, I figured I could go explore that later.

That might be because some non-trivial entries need to read from the current batch. It seems worthwhile to explore this at some point.

Ack.


pkg/storage/replica_application.go, line 80 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I just noticed that this stats object only contains information about entry application at the moment. How do you feel about lifting it into here entirely? That prevents us from adding non-application related stats, so maybe that's not ideal. If not, how do you feel about creating a handleCommittedEntriesStats object and returning it from here?

Done.


pkg/storage/replica_application.go, line 288 at r2 (raw file):

		}

		// Update the node clock with the serviced request. This maintains

This can get coalesced per batch right? I've added a TODO


pkg/storage/replica_application.go, line 310 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do you mind verifying that each of these pre-apply effects still work correctly now that we allow trivial entries to proceed a non-trivial entry? I think the only change we'll need to make is to adjust preDestroyRaftMuLocked to use the current batch instead of tmpBatch, but it would be nice to verify all of them.

This is actually tricky for the preDestroyRaftMuLocked case. The old logic applies the current entry's writeBatch to the tmpBatch. When this is called we haven't yet written the current entry's writeBatch to the batch. In order to give an equivalent view we'd need to create another tmpBatch and write the current batch's repr into it and then write this writeBatch's repr into it. That sounds rather expensive. We may want to rework this entirely.

As for split's use of tmpBatch, I think it's safe and added a comment.

The other checks I believe are safe because they only access fields on the replica updated by non-trivial commands.


pkg/storage/replica_application.go, line 718 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should this be stageTrivialReplicatedEvalResult because it's not modifying the real replicated state?

Done. Also fixed some grammar.


pkg/storage/replica_application.go, line 741 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Upon success

or error

Done.


pkg/storage/replica_application.go, line 780 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Consider linking to the new issue you just filed.

Done.


pkg/storage/replica_application.go, line 826 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This is subtle. Are you favoring this over marking each entry's RaftLogDelta as 0 like we currently do? Could you add a comment explaining what's going on here?

I'm not sure I totally follow what you mean w/ regards to marking each entry's RaftLogDelta as 0. I believe that logic remains the same. This last call here is about making sure that the change in the raft log size due to sideloaded commands is properly reflected. One source of confusion was the name truncationDelta which is actually the bytes removed by truncating the sideloaded entries. I've renamed the variable for clarity.

Do you have more concerns?


pkg/storage/replica_application.go, line 977 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Is this ever false now?

No, I've made this an assertion.


pkg/storage/replica_application.go, line 1238 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: we swapped from lResult to localResult but then added this rResult method. We should keep them symmetrical. lResult and rResult are ok, although I find that I never remember what they mean when skimming this code. I think changing this method to replicatedResult would be the best solution.

Done.


pkg/storage/replica_application.go, line 1246 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

initializes

Done.


pkg/storage/replica_application.go, line 1250 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: consider s/expl/errExpl/ here and below.

Done.


pkg/storage/replica_application.go, line 1342 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: mutating the toProcess slice directly would make this a little cleaner. Something like:

for len(toProcess) > 0 {
    e := &toProcess[0]
    toProcess = toProcess[1:]
    ...
    if !isTrivial(...) {
        break
    }
}
return toProcess

Done.


pkg/storage/replica_application.go, line 1363 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
which implies that it can be applied in a batch with other trivial commands

This is a little off now. So is the "A result is fundamentally considered "trivial" if it does ..." sentence,

Updated somewhat though I'm not sure I see how theA result is fundamentally considered "trivial" if it does ... sentence is off?


pkg/storage/replica_application.go, line 1392 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

The replicaState contains a number of nested pointers, so it's probably worth clearing that out.

Done.

@ajwerner ajwerner force-pushed the ajwerner/batch-command-application branch from 715fd6d to 27cf2b4 Compare July 8, 2019 19:49
Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/storage/replica_application.go, line 537 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

s/retreive/retrieve/

Done.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 6 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)


pkg/storage/entry_application_state_buf.go, line 34 at r3 (raw file):

}

// entryApplicationStateRingBuf is a ring-buffer of entryApplicationState.

This comment is wrong.


pkg/storage/replica_application.go, line 288 at r2 (raw file):

Previously, ajwerner wrote…

This can get coalesced per batch right? I've added a TODO

Yes, although I think it needs to happen before we ack requests, so the early acknowledgment change will need to be careful about this.


pkg/storage/replica_application.go, line 310 at r2 (raw file):

We may want to rework this entirely.

What did you have in mind? I wonder if we could clear the batch completely.


pkg/storage/replica_application.go, line 826 at r2 (raw file):

Previously, ajwerner wrote…

I'm not sure I totally follow what you mean w/ regards to marking each entry's RaftLogDelta as 0. I believe that logic remains the same. This last call here is about making sure that the change in the raft log size due to sideloaded commands is properly reflected. One source of confusion was the name truncationDelta which is actually the bytes removed by truncating the sideloaded entries. I've renamed the variable for clarity.

Do you have more concerns?

Oh, I see. Ok that makes sense.


pkg/storage/replica_application.go, line 1250 at r2 (raw file):

Previously, ajwerner wrote…

Done.

Not for this change, but we could clean some of this (errExp, expl) stuff up with an error wrapper type and the new errors.As style.

https://github.com/cockroachdb/errors#providing-pii-free-details discusses a errors.WithSafeDetails function, which might be exactly what we want.

I'd leave a TODO.


pkg/storage/replica_application.go, line 1363 at r2 (raw file):

Previously, ajwerner wrote…

Updated somewhat though I'm not sure I see how theA result is fundamentally considered "trivial" if it does ... sentence is off?

It looks good now.

@ajwerner ajwerner force-pushed the ajwerner/batch-command-application branch from 27cf2b4 to 3b6194e Compare July 8, 2019 21:57
Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/storage/entry_application_state_buf.go, line 34 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This comment is wrong.

Fixed. Also just inlined all the Node methods into the buf itself. I don't think they added much now that the data structure is so much simpler.


pkg/storage/replica_application.go, line 288 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Yes, although I think it needs to happen before we ack requests, so the early acknowledgment change will need to be careful about this.

Sure, it can be done immediately after decoding a batch.


pkg/storage/replica_application.go, line 310 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

We may want to rework this entirely.

What did you have in mind? I wonder if we could clear the batch completely.

This code seems to only read from the batch to determine if it's going to delete enough keys to warrant a range deletion tombstone (https://sourcegraph.com/github.com/cockroachdb/cockroach/-/blob/pkg/storage/replica_raftstorage.go#L665). I was going to suggest we make this heuristic more explicit at this level and see if we can find some other indication, perhaps combining the existing heuristic with an MVCCStats delta.

On second thought it seems like it's probably okay to base the heuristic on the state of the range prior to this write batch (your initial suggestion) and add more comments. It's not clear to me how important it is to also observe the writes of this batch to the range. Thoughts?


pkg/storage/replica_application.go, line 1250 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Not for this change, but we could clean some of this (errExp, expl) stuff up with an error wrapper type and the new errors.As style.

https://github.com/cockroachdb/errors#providing-pii-free-details discusses a errors.WithSafeDetails function, which might be exactly what we want.

I'd leave a TODO.

Done.

@ajwerner ajwerner force-pushed the ajwerner/batch-command-application branch from 3b6194e to 3db9579 Compare July 8, 2019 22:32
Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/storage/replica_application.go, line 310 at r2 (raw file):

Previously, ajwerner wrote…

This code seems to only read from the batch to determine if it's going to delete enough keys to warrant a range deletion tombstone (https://sourcegraph.com/github.com/cockroachdb/cockroach/-/blob/pkg/storage/replica_raftstorage.go#L665). I was going to suggest we make this heuristic more explicit at this level and see if we can find some other indication, perhaps combining the existing heuristic with an MVCCStats delta.

On second thought it seems like it's probably okay to base the heuristic on the state of the range prior to this write batch (your initial suggestion) and add more comments. It's not clear to me how important it is to also observe the writes of this batch to the range. Thoughts?

Discussed offline. These state updates need to be atomic with the batch but they don't need to necessarily be part of the writeBatch that comes with the command. I moved the application of the entry's writeBatch up to precede these preapply calls. It all seems to work and avoids the inefficiency of the tmpBatch.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm_strong:

Reviewed 1 of 2 files at r4.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)

@tbg tbg self-requested a review July 9, 2019 15:45
@tbg
Copy link
Member

tbg commented Jul 9, 2019

Just chatted with @nvanbenschoten and I'll also give this a full review. Won't get to it until tomorrow though.

@ajwerner
Copy link
Contributor Author

ajwerner commented Jul 9, 2019

Just chatted with @nvanbenschoten and I'll also give this a full review. Won't get to it until tomorrow though.

No rush and thanks!

@ajwerner ajwerner force-pushed the ajwerner/batch-command-application branch from 3db9579 to d4c6d91 Compare July 9, 2019 16:38
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phew, that was a chunk to get through but here I am.

The way you're structuring the batching is very well thought out and there's lots of good commentary that I'm sure wasn't there before. That is to say, even if this hadn't moved the needle on perf at all, it would have been worth it.

There's some code movement in this PR and perhaps reviewing would've been easier had you factored that out into a separate commit, but I know that this sort of thing constitutes a herculean feat in itself after a certain point, and not having the movement split out forced me to review more of the existing code in detail (again), which is good.

Still, obviously the change is quite complex and it's hard for me to come away from the PR feeling comfortable that everything is chipper. My most salient suggestion among all of the comments is that of being more conservative w.r.t what to batch: right now it's "stop batch when you see something nontrivial", I'd prefer moving to a "nontrivial gets their own batch" model. I think I also found a place (AddSSTable) in which the former is incorrect, but even if that were the case, we just shouldn't be batching anything that doesn't move the perf needle so that a) we get easier semantics and b) the mistakes we will certainly make in the future will be less likely to actually manifest in incorrect behavior, though I'd also like to see some testing for b) that strikes when we add side effects without updating the black/whitelists/the test.

A fair share of the comments I left in this PR don't need to be addressed in the PR itself, so whatever you feel like leaving out or punting (including to someone else), that's fine by me.

Reviewed 1 of 7 files at r2, 1 of 6 files at r3, 1 of 2 files at r4, 4 of 4 files at r5.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner)


pkg/storage/entry_application_state_buf.go, line 43 at r5 (raw file):

first


pkg/storage/entry_application_state_buf.go, line 71 at r5 (raw file):

		entryApplicationStateBufNodeSyncPool.Put(oldHead)
	}
	*buf.head = entryApplicationStateBufNode{}

I'm surprised that you keep one node around and manually remove it in destroy(), that feels icky. If there's a reason add a comment, otherwise simplify.


pkg/storage/entry_application_state_buf.go, line 85 at r5 (raw file):

	*buf = entryApplicationStateBuf{}
}

Add brief comments on all the things from here on down, and in particular on the return value of init.


pkg/storage/entry_application_state_buf.go, line 91 at r5 (raw file):

type entryApplicationStateBufIterator struct {
	idx    int32

Comment on idx and offset. Isn't offset just idx % entryApplicationStateBufNodeSize? If so consider simplifying, it'll be easier to understand with just one counter.


pkg/storage/replica_application.go, line 38 at r5 (raw file):

type handleCommittedEntriesStats struct {
	processed int

Is this commands or batches? We should be able to tell from these stats whether there's a frequent command that is breaking up these batches (i.e. isn't isTrivial()). The caller already knows how many commands there were (len(committedEntries) so we don't necessarily need that), but the number of batches is probably interesting, along with the number of nontrivial commands (which is mostly known from the number of batches, but not completely). At a high level it's worth keeping both because we want to be able to see batches/sec as well as whether there's a significant source of "batch breaker" commands.


pkg/storage/replica_application.go, line 71 at r5 (raw file):

stages:


pkg/storage/replica_application.go, line 77 at r5 (raw file):

// reading the current replica state from underneath the Replica.mu and
// determining whether any of the commands were proposed locally. Next each
// command is written to the engine.Batch and have the "trivial" component of

s/have/has/


pkg/storage/replica_application.go, line 78 at r5 (raw file):

// determining whether any of the commands were proposed locally. Next each
// command is written to the engine.Batch and have the "trivial" component of
// their ReplicatedEvalResult applied to the batch's view of the ReplicaState.

s/their/its/


pkg/storage/replica_application.go, line 79 at r5 (raw file):

, its side effects on the Replica's state is applied, and the clients acked with the respective results.


pkg/storage/replica_application.go, line 80 at r5 (raw file):

// their ReplicatedEvalResult applied to the batch's view of the ReplicaState.
// Finally the batch is written to the storage engine and its side effects on
// the Replica's state are applied.

Good job on the comment!


pkg/storage/replica_application.go, line 80 at r5 (raw file):

// their ReplicatedEvalResult applied to the batch's view of the ReplicaState.
// Finally the batch is written to the storage engine and its side effects on
// the Replica's state are applied.

Mention that errors from this method are fatal. (They still are, right)


pkg/storage/replica_application.go, line 85 at r5 (raw file):

) (
	stats handleCommittedEntriesStats,
	refreshReason refreshRaftReason,

At this point it's neither clear to me why this receives a refreshReason but even less why it returns a new one. Can you say something about this in a comment? There's probably a canonical situation in which this is necessary and it'll be helpful to prime readers for it.

I'm seeing below that this is only relevant when we see an empty entry, so how do you feel about just returning a bool or even just adding a numEmptyEntries field in the stats, so that the refresh reason mangling can live at the caller. I feel that this would remove an unnecessary distraction from this method.


pkg/storage/replica_application.go, line 90 at r5 (raw file):

) {
	// TODO(ajwerner): replace errExpl with proper error composition using the new
	// errors library.

😍


pkg/storage/replica_application.go, line 125 at r5 (raw file):

			}
		}
		if err != nil {

Handle this right after err is populated, pretty sure that's only unintentionally not the case right now.


pkg/storage/replica_application.go, line 133 at r5 (raw file):

		// engine.Batch and update b's view of the replicaState.
		var it entryApplicationStateBufIterator
		for ok := it.init(&b.appStates); ok; ok = it.next() {

When is ok initially false?


pkg/storage/replica_application.go, line 134 at r5 (raw file):

		var it entryApplicationStateBufIterator
		for ok := it.init(&b.appStates); ok; ok = it.next() {
			s := it.state()

state is pretty opaque, do you think we can come up with something better? This is basically a "context for applying an entry. Maybe entryApplicationStatecould becomeapplicationContext(or even justapplication) and the local variable here could be app. I don't know how I feel about the type renames I'm suggesting, maybe renaming to stoappis enough here. Or even justcmd, since you're passing it to stageRaftCommand. The reason not to go with that would be if cmdis too ambiguous, but I quite like it. OrappCmd` to make it clear that we're applying?


pkg/storage/replica_application.go, line 135 at r5 (raw file):

		for ok := it.init(&b.appStates); ok; ok = it.next() {
			s := it.state()
			r.stageRaftCommand(s.ctx, s, b.batch, &b.replicaState, it.isLast())

it.isLast() /* writeAppliedState */


pkg/storage/replica_application.go, line 136 at r5 (raw file):

			s := it.state()
			r.stageRaftCommand(s.ctx, s, b.batch, &b.replicaState, it.isLast())
			updatedTruncatedState := stageTrivialReplicatedEvalResult(s.ctx,

Put some words about why this is handled here, looks sort of one-offy and there's a good reason I'm sure.


pkg/storage/replica_application.go, line 155 at r5 (raw file):

	*b.replicaState.Stats = *r.mu.state.Stats
	var it entryApplicationStateBufIterator
	for ok := it.init(&b.appStates); ok; ok = it.next() {

I was initially wondering why you did this iteration twice, but you're doing it because this is under the replica mutex.


pkg/storage/replica_application.go, line 165 at r5 (raw file):

			// the same is true for quotaReleaseQueues. Only queue the proposal's
			// quota for release if the proposalQuota is initialized.
			if r.mu.proposalQuota != nil {

Probably not worth it, but you can hoist this nil check out of the loop.


pkg/storage/replica_application.go, line 181 at r5 (raw file):

// state machine, and lastly the GCThreshold is validated. If any of the checks
// fail, the proposal's content is wiped and we apply an empty log entry
// instead, returning an error to the caller to handle. The two typical cases

No error is returned to the caller, it will go to whoever is waiting on the proposal if proposed locally, and discarded otherwise.


pkg/storage/replica_application.go, line 190 at r5 (raw file):

The standard way in which this is triggered

This doesn't add much, consider something like

in this method via a side effect (in the proposal's ReplicatedEvalResult or, for local proposals, LocalEvalResult).


pkg/storage/replica_application.go, line 268 at r5 (raw file):

			s.forcedErr = roachpb.NewError(err)
		} else if splitMergeUnlock != nil {
			// Close over raftCmd to capture its value at execution time; we clear

This comment needs a touch up.


pkg/storage/replica_application.go, line 300 at r5 (raw file):

		// a timestamp specified are guaranteed one higher than any op already
		// executed for overlapping keys.
		// TODO(ajwerner): coalesce the clock update per batch.

👍


pkg/storage/replica_application.go, line 337 at r5 (raw file):

		// the SSTable. Not doing so could result in order reversal (and missing
		// values) here. If the key range we are ingesting into isn't empty,
		// we're not using AddSSTable but a plain WriteBatch.

This isn't true any more, see

// *If* the keys in the SST do not conflict with keys currently in this range,
// then adding the stats for this SST to the range stats should yield the
// correct overall stats.
//
// *However*, if the keys in this range *do* overlap with keys already in this
// range, then adding the SST semantically *replaces*, rather than adds, those
// keys, and the net effect on the stats is not so simple.


pkg/storage/replica_application.go, line 382 at r5 (raw file):

}

func checkForcedErr(

Am I reviewing this or is this an exact move with adjustments only?


pkg/storage/replica_application.go, line 529 at r5 (raw file):

(which must be considered fatal)


pkg/storage/replica_application.go, line 537 at r5 (raw file):

	replicatedResult storagepb.ReplicatedEvalResult,
	raftAppliedIndex, leaseAppliedIndex uint64,
	writeBatch *storagepb.WriteBatch,

Maybe not for this PR, but passing a pointer here doesn't make much sense. WriteBatch contains only a []byte, so we can just pass it by value and then these writeBatch != nil checks just turn into len(writeBatch.Data) > 0. Or just pass a []byte in, that makes even more sense probably.


pkg/storage/replica_application.go, line 549 at r5 (raw file):

			log.Errorf(ctx, "unable to read header of committed WriteBatch: %+v", err)
		} else {
			r.writeStats.recordCount(float64(mutationCount), 0 /* nodeID */)

All of these contend on a mutex (only in the write path), so it might be worth hoisting this out too.


pkg/storage/replica_application.go, line 578 at r5 (raw file):

	haveTruncatedState := replicatedResult.State != nil && replicatedResult.State.TruncatedState != nil

	if writeBatch != nil {

In particular, if we ever got a WriteBatch{} here, we'd fail because (I think) ApplyBatchRepr expects at least the magic header in the WriteBatch, so this check is better written as len(writeBatch.Data) > 0 once writeBatch itself is a value.


pkg/storage/replica_application.go, line 665 at r5 (raw file):

	}

	// TODO(peter): We did not close the writer in an earlier version of

This TODO is firmly in the "never gonna happen" department. Remove it.


pkg/storage/replica_application.go, line 698 at r5 (raw file):

	elapsed := timeutil.Since(start)
	r.store.metrics.RaftCommandCommitLatency.RecordValue(elapsed.Nanoseconds())

Rename to RaftCommandApplyLatency. I've been wondering what this number is more than once. (Hmm, this will rename the time series, but honestly it's still a good idea to do the rename)


pkg/storage/replica_application.go, line 708 at r5 (raw file):

// not modify replicatedResult in order to give the TestingPostApplyFilter testing knob
// an opportunity to inspect the command's ReplicatedEvalResult.
func stageTrivialReplicatedEvalResult(

To me this method gives more ammunition to the argument that isTrivial should be a whitelist. Let's put this and isTrivial next to each other so that they're more obviously in sync, too.

Between those there's enough to pull out yet another file, I think. replica_application_result.go? This file has ~1.4k LOC and the side effects pull out sort of cleanly.


pkg/storage/replica_application.go, line 722 at r5 (raw file):

		replicaState.LeaseAppliedIndex = leaseAppliedIndex
	}
	haveState := replicatedResult.State != nil

Is there any real benefit to including this in the "trivial" results? What's in here that happens frequently enough to matter? (I generally think we ought to batch only the stuff that makes us fast, not any bit more than that).


pkg/storage/replica_application.go, line 775 at r5 (raw file):

		truncState := b.replicaState.TruncatedState
		r.store.raftEntryCache.Clear(r.RangeID, truncState.Index+1)
		// Truncate the sideloaded storage.  This is true at the time of writing but unfortunately

That second sentence is ... unexpected. Also, doesn't seem like the extra level of indentation below is needed.


pkg/storage/replica_application.go, line 810 at r5 (raw file):

	// Finally apply the truncation delta.
	// Store the queuing conditions
	var it entryApplicationStateBufIterator

I'd rewrite this section as

combinedDelta := -sideloadTruncationDelta
for /* ... */ {
  combinedDelta += s.replicatedResult().RaftLogDelta
  s.replicatedResult().RaftLogDelta =  0
}
if combinedDelta != 0 {
  /* inline the closure */
}

You were careful here because you didn't want mixing between the two contributions? That should be OK. If we trust our log size (and only then is it used for truncation decisions) the clamping oughtn't change the outcome.


pkg/storage/replica_application.go, line 830 at r5 (raw file):

	r.store.metrics.addMVCCStats(deltaStats)
	// NB: the bootstrap store has a nil split queue.
	// TODO(tbg): the above is probably a lie now.

I really should've just removed this and not made a lame TODO out in my name here and below. Alas.


pkg/storage/replica_application.go, line 850 at r5 (raw file):

		s := it.state()
		replicatedResult := s.replicatedResult()
		for _, sc := range replicatedResult.SuggestedCompactions {

You could collect the suggestions in the RaftLogDelta iteration below. Not sure that's any better. It definitely isn't any clearer.


pkg/storage/replica_application.go, line 866 at r5 (raw file):

		// Note that this must happen after committing (the engine.Batch), but
		// before notifying a potentially waiting client.
		clearTrivialReplicatedEvalResultFields(s.replicatedResult(), b.replicaState.UsingAppliedStateKey)

This method should also sit next to isTrivial and the other one I commented about (stageTrivialReplicatedEvalResult)?


pkg/storage/replica_application.go, line 877 at r5 (raw file):

		// Some tests (TestRangeStatsInit) assumes that once the store has started
		// and the first range has a lease that there will not be a later hard-state.
		if isNonTrivialEntry {

If we move to a more conservative batching model, we may want to have another whitelist on what to assert. Though maybe this is OK. We should keep track on how many assertStateLocked calls we have (on stats).


pkg/storage/replica_application.go, line 916 at r5 (raw file):

result in


pkg/storage/replica_application.go, line 917 at r5 (raw file):

// ReplicaState. This function is called after a batch has been written to the
// storage engine. For trivial commands this function should result is a zero
// value replicatedResult.

nit: *replicaResult.

Check this in race builds at the caller somewhere.


pkg/storage/replica_application.go, line 1045 at r5 (raw file):

			// processing the removal from the queue looks up the Range at the
			// lease holder, being too early here turns this into a no-op).
			// Lock ordering dictates that we don't hold any mutexes when adding,

Comment is stale, the task is now built into the queues.


pkg/storage/replica_application.go, line 1187 at r5 (raw file):

}

// entryApplicationState stores the state required to apply a single raft

Move this and its receivers to replica_application_state.go?


pkg/storage/replica_application.go, line 1288 at r5 (raw file):

}

// entryApplicationBatch accumulates state due to the application of raft

Ditto replica_application_batch.go


pkg/storage/replica_application.go, line 1370 at r5 (raw file):

// side effects which rely on the written state of the replica exactly matching
// the in-memory state of the replica at the corresponding log position.
// Non-trivial commands must be the last entry in a batch so that after

From what I'm seeing below steady state traffic is always trivial, and I assume that'll stay this way no matter what we add, so it won't matter for performance and we should lower the cognitive burden as much as possible. Can we change it so that !isTrivial() implies "alone in batch", not just "last in batch"? It'll be nice to have the guarantee that no mixing whatsover will occur.

For example, look at AddSSTable. We add the SST to the LSM before applying the command (we have to, the other way around doesn't work at all). But it means that the previous cmds in the batch shouldn't touch anything that the SST also touches, or the SST will lose precedence: the SST will be shadowed by writes in earlier commands. That's incorrect. On the other hand, we know inside of the AddSSTable command itself that we won't overlap with the WriteBatch of the command itself, because we know to not throw random stuff into the WriteBatch in cmd_add_sstable.go.

Also, what's the reason for using a blacklist? I'd much rather deal with commands accidentally not getting batched as opposed to adding something new, forgetting to add it here, and batching it in with unknown results. But either way, there needs to be a test (maybe you wrote it already) that fails whenever a new field is added, and forces an update to the test as a sign-off on awareness of isTrivial.


pkg/storage/replica_application.go, line 1395 at r5 (raw file):

}

func (b *entryApplicationBatch) reset() {

Can you write this as

appStates := b.appStates
appStates.clear()
*b = entryApplicationBatch{appStates: appStates}

so I don't have to be nervous that a field isn't cleared now or after any future change?


pkg/storage/entry_application_state_buf_test.go, line 20 at r5 (raw file):

)

// TestApplicationStateBuf is an overly simplistic test of the

👀 does that imply that you're replacing this with a less overly simplistic one?

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and I meant to say: great work! Delighted with how this is all shaping up.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner)

@ajwerner
Copy link
Contributor Author

Still, obviously the change is quite complex and it's hard for me to come away from the PR feeling comfortable that everything is chipper. My most salient suggestion among all of the comments is that of being more conservative w.r.t what to batch: right now it's "stop batch when you see something nontrivial", I'd prefer moving to a "nontrivial gets their own batch" model. I think I also found a place (AddSSTable) in which the former is incorrect, but even if that were the case, we just shouldn't be batching anything that doesn't move the perf needle so that a) we get easier semantics and b) the mistakes we will certainly make in the future will be less likely to actually manifest in incorrect behavior, though I'd also like to see some testing for b) that strikes when we add side effects without updating the black/whitelists/the test.

:) should have stopped and let you have a say before I went on polishing this. The initial version of this PR did exactly this, the buffer had a mechanism to store an extra entry as a sort of "lookahead".

From Nathan:

I'm wondering if this is overly restrictive. My intuition would be that we could probably get away with having non-trivial commands be part of a batch as long as they are the last command in the batch. This construction would dramatically simplify the code structure here because it would turn the pushing and unpushing process into a simple process of: 1. batch as much as possible, 2: flush after each non-trivial command is added to the batch, 3. flush again at the end. It would also allow us to think of each batch as having n "trivial side effects" and one "complex side effect", which is easy to understand and to structure in the code.

I'm happy to go back to the previous approach but let's reach consensus before I make the change.

@tbg
Copy link
Member

tbg commented Jul 10, 2019

Heh, oops, sorry about that. @nvanbenschoten what do you think? I never saw the version of the code I suggested, you have, so see what you think about the safety argument. The AddSSTable thing would've probably slipped in had I not spent some time thinking about it, and maybe it would've never mattered in practice, but it just seems scary since we'll be using it more and more on "liver" data as time goes by.

@nvanbenschoten
Copy link
Member

The AddSSTable example you brought up does seem bad - nice catch. I don't think it would ever cause any issues in practice because above-Raft latches would protect us from ever mixing any entries that aren't allowed to commute, but it's concerning that entries would have the potential to no longer be atomic.

I think I'm ok with moving back to the "non-trivial entries must live alone" style, but I hope we don't regress to the lookahead mechanism we had before, along with the highly-specialized entryApplicationStateBuf interface to accommodate it. The goal with that approach was to avoid decoding entries multiple times, which is a nice property. What made it a little clunky is that we were trying to emplace into the entryApplicationStateBuf while decoding and could only determine whether the newly decoded entry was trivial after this. We'd then try to ignore the last entry in the buffer until coming back around the loop the next time.

I think we can do better by turning decoding into a two-stage process. We could first decode into a temporary entryApplicationState that's not part of the entryApplicationStateBuf. We then check whether this is trivial. If it is, we push it into the buffer and continue decoding. If not, we process the buffer as is, without the new entry. We then begin the loop by checking whether this entry is empty and pushing it into the buffer if not.

Getting the current style to work also allowed us to get rid of the tmpBatchs used in split and merge pre-apply hooks. I hope we can keep that change.

@ajwerner
Copy link
Contributor Author

Getting the current style to work also allowed us to get rid of the tmpBatchs used in split and merge pre-apply hooks. I hope we can keep that change.

I think that this change is orthogonal. It raises a question about why the tmpBatch is there in the first place if we could avoid it by just applying the write batch first.

I think I'm ok with moving back to the "non-trivial entries must live alone" style, but I hope we don't regress to the lookahead mechanism we had before

Reasonable, let me see what I can do.

@ajwerner ajwerner force-pushed the ajwerner/batch-command-application branch from d4c6d91 to 7716922 Compare July 10, 2019 22:55
Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! I suspect there's at least one more round left to deal with my renaming choices and the triviality of log truncation. Maybe also some cleanup on decodeBuf and decodedRaftEntry.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)


pkg/storage/entry_application_state_buf.go, line 43 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

first

Done.


pkg/storage/entry_application_state_buf.go, line 71 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I'm surprised that you keep one node around and manually remove it in destroy(), that feels icky. If there's a reason add a comment, otherwise simplify.

Done. This was a vestige from when the buf had a lookahead feature.


pkg/storage/entry_application_state_buf.go, line 85 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Add brief comments on all the things from here on down, and in particular on the return value of init.

Done.


pkg/storage/entry_application_state_buf.go, line 91 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Comment on idx and offset. Isn't offset just idx % entryApplicationStateBufNodeSize? If so consider simplifying, it'll be easier to understand with just one counter.

Done. A lot of this code got simpler over time, the nodes used to be ring buffers.


pkg/storage/replica_application.go, line 38 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Is this commands or batches? We should be able to tell from these stats whether there's a frequent command that is breaking up these batches (i.e. isn't isTrivial()). The caller already knows how many commands there were (len(committedEntries) so we don't necessarily need that), but the number of batches is probably interesting, along with the number of nontrivial commands (which is mostly known from the number of batches, but not completely). At a high level it's worth keeping both because we want to be able to see batches/sec as well as whether there's a significant source of "batch breaker" commands.

Added more fields.


pkg/storage/replica_application.go, line 71 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

stages:

Done.


pkg/storage/replica_application.go, line 77 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

s/have/has/

Done.


pkg/storage/replica_application.go, line 78 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

s/their/its/

Done.


pkg/storage/replica_application.go, line 79 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

, its side effects on the Replica's state is applied, and the clients acked with the respective results.

Done.


pkg/storage/replica_application.go, line 80 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Good job on the comment!

Thanks!


pkg/storage/replica_application.go, line 80 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Mention that errors from this method are fatal. (They still are, right)

Done.


pkg/storage/replica_application.go, line 85 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

At this point it's neither clear to me why this receives a refreshReason but even less why it returns a new one. Can you say something about this in a comment? There's probably a canonical situation in which this is necessary and it'll be helpful to prime readers for it.

I'm seeing below that this is only relevant when we see an empty entry, so how do you feel about just returning a bool or even just adding a numEmptyEntries field in the stats, so that the refresh reason mangling can live at the caller. I feel that this would remove an unnecessary distraction from this method.

Done.


pkg/storage/replica_application.go, line 125 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Handle this right after err is populated, pretty sure that's only unintentionally not the case right now.

Done.


pkg/storage/replica_application.go, line 133 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

When is ok initially false?

If the batch were empty. I commented the method. It probably is a distraction as the batch shouldn't be empty.


pkg/storage/replica_application.go, line 134 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

state is pretty opaque, do you think we can come up with something better? This is basically a "context for applying an entry. Maybe entryApplicationStatecould becomeapplicationContext(or even justapplication) and the local variable here could be app. I don't know how I feel about the type renames I'm suggesting, maybe renaming to stoappis enough here. Or even justcmd, since you're passing it to stageRaftCommand. The reason not to go with that would be if cmdis too ambiguous, but I quite like it. OrappCmd` to make it clear that we're applying?

I did a big rename, entryApplicationState.* -> cmdAppCtx.*and state() ->cur()


pkg/storage/replica_application.go, line 135 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

it.isLast() /* writeAppliedState */

Done.


pkg/storage/replica_application.go, line 136 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Put some words about why this is handled here, looks sort of one-offy and there's a good reason I'm sure.

Added some, could add more.


pkg/storage/replica_application.go, line 155 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I was initially wondering why you did this iteration twice, but you're doing it because this is under the replica mutex.

Ack.


pkg/storage/replica_application.go, line 165 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Probably not worth it, but you can hoist this nil check out of the loop.

Done.


pkg/storage/replica_application.go, line 181 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

No error is returned to the caller, it will go to whoever is waiting on the proposal if proposed locally, and discarded otherwise.

Done.


pkg/storage/replica_application.go, line 190 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…
The standard way in which this is triggered

This doesn't add much, consider something like

in this method via a side effect (in the proposal's ReplicatedEvalResult or, for local proposals, LocalEvalResult).

Done.


pkg/storage/replica_application.go, line 268 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This comment needs a touch up.

Done.


pkg/storage/replica_application.go, line 337 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This isn't true any more, see

// *If* the keys in the SST do not conflict with keys currently in this range,
// then adding the stats for this SST to the range stats should yield the
// correct overall stats.
//
// *However*, if the keys in this range *do* overlap with keys already in this
// range, then adding the SST semantically *replaces*, rather than adds, those
// keys, and the net effect on the stats is not so simple.

Updated and added an NB to try to state your concern with other entries in the batch preceding an AddSSTable.


pkg/storage/replica_application.go, line 382 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Am I reviewing this or is this an exact move with adjustments only?

It used to refer to Replica.mu.state and now it takes a ReplicaState argument, otherwise it's unchanged.


pkg/storage/replica_application.go, line 529 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

(which must be considered fatal)

s/is likely/is/


pkg/storage/replica_application.go, line 537 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Maybe not for this PR, but passing a pointer here doesn't make much sense. WriteBatch contains only a []byte, so we can just pass it by value and then these writeBatch != nil checks just turn into len(writeBatch.Data) > 0. Or just pass a []byte in, that makes even more sense probably.

I reworked this method to just take a cmdAppCtx which simplifies some things


pkg/storage/replica_application.go, line 549 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

All of these contend on a mutex (only in the write path), so it might be worth hoisting this out too.

Done.


pkg/storage/replica_application.go, line 578 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

In particular, if we ever got a WriteBatch{} here, we'd fail because (I think) ApplyBatchRepr expects at least the magic header in the WriteBatch, so this check is better written as len(writeBatch.Data) > 0 once writeBatch itself is a value.

I hear you but generally didn't find pulling writeBatchData out to be clearer now that cmd is here.


pkg/storage/replica_application.go, line 665 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This TODO is firmly in the "never gonna happen" department. Remove it.

Done.


pkg/storage/replica_application.go, line 698 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Rename to RaftCommandApplyLatency. I've been wondering what this number is more than once. (Hmm, this will rename the time series, but honestly it's still a good idea to do the rename)

That's tricky given the fact that this chart is in admin UI and we need to support mixed version clusters. This is a weird metric that isn't going to be as useful now so maybe we should change the admin UI to not have this chart and then in 20.1 we can rename the metric? Added a TODO


pkg/storage/replica_application.go, line 708 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

To me this method gives more ammunition to the argument that isTrivial should be a whitelist. Let's put this and isTrivial next to each other so that they're more obviously in sync, too.

Between those there's enough to pull out yet another file, I think. replica_application_result.go? This file has ~1.4k LOC and the side effects pull out sort of cleanly.

Done. I moved those functions and also carved out the portion of applyBatch (now applyCmdAppBatch) that related to dealing with the final result of a command. See how it makes you feel.


pkg/storage/replica_application.go, line 722 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Is there any real benefit to including this in the "trivial" results? What's in here that happens frequently enough to matter? (I generally think we ought to batch only the stuff that makes us fast, not any bit more than that).

Only truncated state. Nathan and I also discussed making truncated state commands non-trivial and I'm still happy to do it. If we want to do it I'll do it on the next pass.


pkg/storage/replica_application.go, line 775 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

That second sentence is ... unexpected. Also, doesn't seem like the extra level of indentation below is needed.

Ugh, I hoisted most of that comment above but didn't clean up this trailing sentence.


pkg/storage/replica_application.go, line 810 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I'd rewrite this section as

combinedDelta := -sideloadTruncationDelta
for /* ... */ {
  combinedDelta += s.replicatedResult().RaftLogDelta
  s.replicatedResult().RaftLogDelta =  0
}
if combinedDelta != 0 {
  /* inline the closure */
}

You were careful here because you didn't want mixing between the two contributions? That should be OK. If we trust our log size (and only then is it used for truncation decisions) the clamping oughtn't change the outcome.

Done.


pkg/storage/replica_application.go, line 850 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

You could collect the suggestions in the RaftLogDelta iteration below. Not sure that's any better. It definitely isn't any clearer.

I moved it to the loop below instead.


pkg/storage/replica_application.go, line 866 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This method should also sit next to isTrivial and the other one I commented about (stageTrivialReplicatedEvalResult)?

Done.


pkg/storage/replica_application.go, line 877 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

If we move to a more conservative batching model, we may want to have another whitelist on what to assert. Though maybe this is OK. We should keep track on how many assertStateLocked calls we have (on stats).

Ack.


pkg/storage/replica_application.go, line 916 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

result in

Done.


pkg/storage/replica_application.go, line 917 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

nit: *replicaResult.

Check this in race builds at the caller somewhere.

huh?


pkg/storage/replica_application.go, line 1045 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Comment is stale, the task is now built into the queues.

Done.


pkg/storage/replica_application.go, line 1187 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Move this and its receivers to replica_application_state.go?

It's now in cmd_app_ctx.go however I feel less amazing about that filename.


pkg/storage/replica_application.go, line 1288 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Ditto replica_application_batch.go

Done.


pkg/storage/replica_application.go, line 1370 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

From what I'm seeing below steady state traffic is always trivial, and I assume that'll stay this way no matter what we add, so it won't matter for performance and we should lower the cognitive burden as much as possible. Can we change it so that !isTrivial() implies "alone in batch", not just "last in batch"? It'll be nice to have the guarantee that no mixing whatsover will occur.

For example, look at AddSSTable. We add the SST to the LSM before applying the command (we have to, the other way around doesn't work at all). But it means that the previous cmds in the batch shouldn't touch anything that the SST also touches, or the SST will lose precedence: the SST will be shadowed by writes in earlier commands. That's incorrect. On the other hand, we know inside of the AddSSTable command itself that we won't overlap with the WriteBatch of the command itself, because we know to not throw random stuff into the WriteBatch in cmd_add_sstable.go.

Also, what's the reason for using a blacklist? I'd much rather deal with commands accidentally not getting batched as opposed to adding something new, forgetting to add it here, and batching it in with unknown results. But either way, there needs to be a test (maybe you wrote it already) that fails whenever a new field is added, and forces an update to the test as a sign-off on awareness of isTrivial.

Done.


pkg/storage/replica_application.go, line 1395 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Can you write this as

appStates := b.appStates
appStates.clear()
*b = entryApplicationBatch{appStates: appStates}

so I don't have to be nervous that a field isn't cleared now or after any future change?

Simplified but I need to hold on to the new decodeBuf.


pkg/storage/entry_application_state_buf_test.go, line 20 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

👀 does that imply that you're replacing this with a less overly simplistic one?

When I started typing it the data structure was more complicated than it ultimately needed to be. This test, small as it is covers 100% of entry_application_state_buf.go. I added some more comments to make it seems less silly.

@ajwerner ajwerner force-pushed the ajwerner/batch-command-application branch 2 times, most recently from 380c745 to a335ffb Compare July 10, 2019 23:16
Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)


pkg/storage/replica_application.go, line 38 at r5 (raw file):

Previously, ajwerner wrote…

Added more fields.

Though I didn't do anything with them. Probably I should add more metrics. Added a TODO.


pkg/storage/replica_application.go, line 633 at r6 (raw file):

			cmd.replicatedResult().State.TruncatedState = nil
			cmd.replicatedResult().RaftLogDelta = 0
			// TODO(ajwerner): consider moving this code.

@tbg I'm curious to hear how you feel about this here. It's the only place we really reach down and muck with the Replica. If we hoisted this and then passed the raftMu.stateLoader this wouldn't even need to be a method on Replica.

@ajwerner ajwerner force-pushed the ajwerner/batch-command-application branch from a335ffb to da52af6 Compare July 11, 2019 02:22
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm_strong:

Reviewed 5 of 10 files at r6, 4 of 4 files at r7.
Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @ajwerner and @tbg)


pkg/storage/cmd_app_batch.go, line 83 at r7 (raw file):

toProcess


pkg/storage/cmd_app_batch.go, line 89 at r7 (raw file):

In that case, the returned decodedCmdRequiresFlush is true to indicate ...


pkg/storage/cmd_app_batch.go, line 105 at r7 (raw file):

	for len(toProcess) > 0 {
		e := &toProcess[0]
		// If we haven't already decoded this entry then we decode it into the

Why "If we haven't already"? We're always decoding into the scratch space here.


pkg/storage/cmd_app_ctx.go, line 70 at r7 (raw file):

	idKey   storagebase.CmdIDKey
	raftCmd storagepb.RaftCommand
	cc      raftpb.ConfChange // only populated for conf changes

nit: cc and ccCtx aren't trivial structs, might be worth having them be pointers since they're barely ever populated. It probably isn't worth it.


pkg/storage/cmd_app_ctx_buf.go, line 38 at r7 (raw file):

}

var entryApplicationStateBufNodeSyncPool = sync.Pool{

Rename


pkg/storage/replica_application.go, line 722 at r5 (raw file):

Previously, ajwerner wrote…

Only truncated state. Nathan and I also discussed making truncated state commands non-trivial and I'm still happy to do it. If we want to do it I'll do it on the next pass.

Raft truncation is pretty frequent, I'd worry that that'll backfire somehow. Mild preference to keep it trivial then. Maybe we'll manage to restructure the log truncation so that it's less invasive, too (i.e. not synced through a Raft command in the first place).


pkg/storage/replica_application.go, line 633 at r6 (raw file):

Previously, ajwerner wrote…

@tbg I'm curious to hear how you feel about this here. It's the only place we really reach down and muck with the Replica. If we hoisted this and then passed the raftMu.stateLoader this wouldn't even need to be a method on Replica.

I like that, but more as a follow-up than eagerly - this PR should land and it's becoming quite difficult to trace the correctness of these smaller changes through the diffs.


pkg/storage/replica_application.go, line 98 at r7 (raw file):

) (stats handleCommittedEntriesStats, errExpl string, err error) {
	var (
		decodeBuf        decodedRaftEntry

This escapes, right?


pkg/storage/replica_application.go, line 104 at r7 (raw file):

// The batch is always empty at the beginning of this loop. In each iteration, we either process a chunk of trivial entries or an individual nontrivial entry. The haveDecodedEntry branch always corresponds to a non-trivial entry, while the else branch processes either a single nontrivial entry or any number of nontrivial entries.

This might get a little more streamlined as far as the explanation goes if instead of handling a nontrivial command that is encountered first in the else branch eagerly, you defer it via haveDecodedEntry. Then you could say

The haveDecodedEntry branch corresponds to a nontrivial entry, while the else branch populates b with any number (possibly zero) of trivial entries.

I think that's worth doing, but I'll defer to your judgment.


pkg/storage/replica_application.go, line 141 at r7 (raw file):

retrieve


pkg/storage/replica_application.go, line 181 at r7 (raw file):

// proposal's content is wiped and we apply an empty log entry instead. If an
// error occurs and the command was proposed locally, it will be communicated to
// a waiting proposer. The two typical cases errors occur are lease mismatch (in

Grammar off


pkg/storage/replica_application.go, line 575 at r7 (raw file):

	//
	// TODO(ajwerner): explore the costs and benefits of this use of Distinct().
	// This call requires flushing the batch's writes but

the suspense!


pkg/storage/replica_application.go, line 637 at r7 (raw file):

		cmd.replicatedResult().State.TruncatedState != nil
	if haveTruncatedState {
		apply, err := handleTruncatedStateBelowRaft(ctx, replicaState.TruncatedState,

Hmm, now I'm not sure this batches correctly. handleTruncatedStateBelowRaft loads from the engine, which now fails to reflect any prior truncations in the same batch. This is all probably fine since we're throwing away all but the last truncation (and don't do anything with the information in the ones we throw away prior to tossing them). Add some commentary.


pkg/storage/replica_application_result.go, line 92 at r7 (raw file):

	truncatedStateUpdated = haveState && replicatedResult.State.TruncatedState != nil
	if truncatedStateUpdated {
		replicaState.TruncatedState = replicatedResult.State.TruncatedState

This technically means that when multiple truncations are batched together, we'll just use the last one, right? I think this is OK since the actual log deletion is a side effect of whichever truncated state survives. Just making sure I understand this correctly.


pkg/storage/replica_application_result.go, line 131 at r7 (raw file):

			replicatedResult.State.UsingAppliedStateKey = false
		}
		// ReplicaState.Stats was previously non-nullable which caused nodes to

Is this still current? Sounds like something we can just remove at this point (fine as a follow up).


pkg/storage/replica_application_result.go, line 150 at r7 (raw file):

// applied to the Replica's in-memory state. This method deals with applying
// non-trivial side effects, notifying waiting clients, releasing latches,
// stepping config changes, and asserting hard state as required.

nit: technically we're not stepping config changes. You could say "informing Raft about applied config changes"


pkg/storage/store.go, line 3613 at r7 (raw file):

	// remote replicas will likely start campaigning.
	if elapsed >= defaultReplicaRaftMuWarnThreshold {
		log.Warningf(ctx, "handle raft ready: %.1fs [processed=%d]",

print some other numbers here, it'll be interesting.

@ajwerner ajwerner force-pushed the ajwerner/batch-command-application branch from da52af6 to 6cbe344 Compare July 11, 2019 20:40
Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR! @tbg can you give the change for the truncated state a glance before I bors this.

Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @tbg)


pkg/storage/cmd_app_batch.go, line 83 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

toProcess

Done.


pkg/storage/cmd_app_batch.go, line 89 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

In that case, the returned decodedCmdRequiresFlush is true to indicate ...

Cleaned up the logic as suggested.


pkg/storage/cmd_app_batch.go, line 105 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Why "If we haven't already"? We're always decoding into the scratch space here.

Stale.


pkg/storage/cmd_app_ctx.go, line 70 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

nit: cc and ccCtx aren't trivial structs, might be worth having them be pointers since they're barely ever populated. It probably isn't worth it.

Done.


pkg/storage/cmd_app_ctx_buf.go, line 38 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Rename

Done.


pkg/storage/replica_application.go, line 722 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Raft truncation is pretty frequent, I'd worry that that'll backfire somehow. Mild preference to keep it trivial then. Maybe we'll manage to restructure the log truncation so that it's less invasive, too (i.e. not synced through a Raft command in the first place).

Ack.


pkg/storage/replica_application.go, line 98 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This escapes, right?

Indeed it did. Moved to the appCtxBatch struct.


pkg/storage/replica_application.go, line 104 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

// The batch is always empty at the beginning of this loop. In each iteration, we either process a chunk of trivial entries or an individual nontrivial entry. The haveDecodedEntry branch always corresponds to a non-trivial entry, while the else branch processes either a single nontrivial entry or any number of nontrivial entries.

This might get a little more streamlined as far as the explanation goes if instead of handling a nontrivial command that is encountered first in the else branch eagerly, you defer it via haveDecodedEntry. Then you could say

The haveDecodedEntry branch corresponds to a nontrivial entry, while the else branch populates b with any number (possibly zero) of trivial entries.

I think that's worth doing, but I'll defer to your judgment.

Done.


pkg/storage/replica_application.go, line 141 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

retrieve

I have a serious problem with these diphthong vowel orderings.


pkg/storage/replica_application.go, line 181 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Grammar off

I'm not totally sure about the grammar qualms. I tried to tighten up the grammar but most of my changes in this area but my changes were mostly just speculation.


pkg/storage/replica_application.go, line 575 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

the suspense!

Done.


pkg/storage/replica_application.go, line 637 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Hmm, now I'm not sure this batches correctly. handleTruncatedStateBelowRaft loads from the engine, which now fails to reflect any prior truncations in the same batch. This is all probably fine since we're throwing away all but the last truncation (and don't do anything with the information in the ones we throw away prior to tossing them). Add some commentary.

If we don't use the Distinct() batch it will be safe, right? I changed it to use the batch rather than the Distinct() writer. I'd like your approval before I merge I merge this.


pkg/storage/replica_application_result.go, line 92 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This technically means that when multiple truncations are batched together, we'll just use the last one, right? I think this is OK since the actual log deletion is a side effect of whichever truncated state survives. Just making sure I understand this correctly.

Correct, comment added.


pkg/storage/replica_application_result.go, line 131 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Is this still current? Sounds like something we can just remove at this point (fine as a follow up).

Added a TODO.


pkg/storage/store.go, line 3613 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

print some other numbers here, it'll be interesting.

Done.

@ajwerner ajwerner force-pushed the ajwerner/batch-command-application branch from 6cbe344 to b7eff72 Compare July 11, 2019 22:22
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

bors it like it's hot

Reviewed 7 of 7 files at r8.
Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @ajwerner)


pkg/storage/cmd_app_batch.go, line 90 at r8 (raw file):

// commands must always be in their own batch. If a non-trivial command is
// encountered the batch is returned immediately without adding the newly
// decoded command to the batch of removing it from remaining.

s/of/or/


pkg/storage/cmd_app_ctx_buf.go, line 38 at r7 (raw file):

Previously, ajwerner wrote…

Done.

Ping


pkg/storage/replica_application.go, line 917 at r5 (raw file):

Previously, ajwerner wrote…

huh?

Wish I could tell you.


pkg/storage/replica_application.go, line 181 at r7 (raw file):

Previously, ajwerner wrote…

I'm not totally sure about the grammar qualms. I tried to tighten up the grammar but most of my changes in this area but my changes were mostly just speculation.

Yeah the missing "in which" threw me off, my brain doesn't think that that's valid (regardless of whether it is), thanks for fixing.


pkg/storage/replica_application.go, line 637 at r7 (raw file):

Previously, ajwerner wrote…

If we don't use the Distinct() batch it will be safe, right? I changed it to use the batch rather than the Distinct() writer. I'd like your approval before I merge I merge this.

Yes, that checks out.
Can you rename distinctEng in handleTruncatedStateBelowRaft to just batch?

I don't know if there's a penalty for using the batch after having used Distinct(), but we probably do that already anyway. And for truncations, the right thing in the long run is to avoid having them triggered by Raft commands, so this particular use we're hoping to remove at some point.

Also, lol at the semantics of Distinct(). I remember now that I wasted time on this in the past.


pkg/storage/replica_application.go, line 111 at r8 (raw file):

into

This commit batches raft command application where possible. The basic approach
is to notice that many commands only "trivially" update the replica state
machine. Trivial commands can be processed in a single batch by acting on a
copy of the replica state. Non-trivial commands share the same logic but always
commit alone as they for one reason or another rely on having a view of the
replica or storage engine as of a specific log index.

This commit also sneaks in another optimization which batching enables. Each
command mutates a portion of replica state called the applied state which
tracks a combination of the log index which has been applied and the MVCC stats
of the range as of that application. Before this change each entry would update
this applied state and each of those writes will end up in the WAL and
mem-table just the be compacted away in L1. Now that commands are being applied
to the storage engine in a single batch it is easy to only update the applied
state for the last entry in the batch.

For sequential writes this patch shows a considerable performance win. The
below benchmark was run on a 3-node c5d.4xlarge (16 vCPU) cluster with
concurrency 128.

```
name            old ops/s   new ops/s   delta
KV0-throughput  22.1k ± 1%  32.8k ± 1%  +48.59%  (p=0.029 n=4+4)

name            old ms/s    new ms/s    delta
KV0-P50          7.15 ± 2%   6.00 ± 0%  -16.08%  (p=0.029 n=4+4)
KV0-Avg          5.80 ± 0%   3.80 ± 0%  -34.48%  (p=0.029 n=4+4)
```

Due to the re-organization of logic in the change, the Replica.mu does not need
to be acquired as many times during the application of a batch. In the common
case it is now acquired exactly twice in the process of applying a batch
whereas before it was acquired more than twice per entry. This should hopefully
improve performance on large machines which experience mutex contention for a
single range.

This effect is visible on large machines. Below are results from running
a normal KV0 workload on c5d.18xlarge machines (72 vCPU machines) with
concurrency 1024 and 16 initial splits.

```
name            old ops/s   new ops/s    delta
KV0-throughput  78.1k ± 1%  116.8k ± 5%  +49.42%  (p=0.029 n=4+4)

name            old ms/s    new ms/s     delta
KV0-P50          24.4 ± 3%    19.7 ± 7%  -19.28%  (p=0.029 n=4+4)
KV0-Avg          12.6 ± 0%     7.5 ± 9%  -40.87%  (p=0.029 n=4+4)
```

Fixes cockroachdb#37426.

Release note (performance improvement): Batch raft entry application and
coalesce writes to applied state for the batch.
@ajwerner ajwerner force-pushed the ajwerner/batch-command-application branch from b7eff72 to e4ce717 Compare July 12, 2019 13:26
@ajwerner
Copy link
Contributor Author

TFTR!

bors r=nvanbenschoten,tbg

craig bot pushed a commit that referenced this pull request Jul 12, 2019
38568: storage: batch command application and coalesce applied state per batch  r=nvanbenschoten,tbg a=ajwerner

This commit batches raft command application where possible. The basic
approach is to notice that many commands only "trivially" update the
replica state machine. Trivial commands can be processed in a single
batch by acting on a copy of the replica state. Non-trivial commands
share the same logic but always commit alone as they for one reason or
another rely on having a view of the replica or storage engine as of a
specific log index.

This commit also sneaks in another optimization which batching enables.
Each command mutates a portion of replica state called the applied state
which tracks a combination of the log index which has been applied and the
MVCC stats of the range as of that application. Before this change each
entry would update this applied state and each of those writes will end up
in the WAL and mem-table just the be compacted away in L0. Now that
commands are being applied to the storage engine in a single batch it is easy
to only update the applied state for the last entry in the batch.

For sequential writes this patch shows a considerable performance win. The below
benchmark was run on a 3-node c5d.4xlarge (16 vCPU) cluster with concurrency 128.

```
name            old ops/s   new ops/s   delta
KV0-throughput  22.1k ± 1%  32.8k ± 1%  +48.59%  (p=0.029 n=4+4)

name            old ms/s    new ms/s    delta
KV0-P50          7.15 ± 2%   6.00 ± 0%  -16.08%  (p=0.029 n=4+4)
KV0-Avg          5.80 ± 0%   3.80 ± 0%  -34.48%  (p=0.029 n=4+4)
```

Due to the re-organization of logic in the change, the Replica.mu does not need
to be acquired as many times during the application of a batch. In the common
case it is now acquired exactly twice in the process of applying a batch whereas
before it was acquired more than twice per entry. This should hopefully improve
performance on large machines which experience mutex contention for a single
range.

These optimizations combine for a big win on large machines. Below are results 
from running a normal KV0 workload on c5d.18xlarge machines (72 vCPU machines) with
concurrency 1024 and 16 initial splits.

```
name            old ops/s   new ops/s    delta
KV0-throughput  78.1k ± 1%  116.8k ± 5%  +49.42%  (p=0.029 n=4+4)

name            old ms/s    new ms/s     delta
KV0-P50          24.4 ± 3%    19.7 ± 7%  -19.28%  (p=0.029 n=4+4)
KV0-Avg          12.6 ± 0%     7.5 ± 9%  -40.87%  (p=0.029 n=4+4)
```

Fixes #37426.

Release note (performance improvement): Batch raft entry application and
coalesce writes to applied state for the batch.

Co-authored-by: Andrew Werner <[email protected]>
@craig
Copy link
Contributor

craig bot commented Jul 12, 2019

Build succeeded

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

storage: batch Raft entry application across ready struct
4 participants