Merge #32437

32437: storage: truncate aggressively only after 4mb of logs r=nvanbenschoten,petermattis a=tbg cc @nvanbenschoten. I'm going to run some kv95 experiments in which I vary the 64kb threshold in both directions to see if there are any effects on performance in doing so. ---- Whenever the "max raft log size" is exceeded, log truncations become more aggressive in that they aim at the quorum commit index, potentially cutting off followers (which then need Raft snapshots). The effective threshold log size is 4mb for replicas larger than 4mb and the replica size otherwise. This latter case can be problematic since replicas can be persistently small despite having steady log progress (for example, range 4 receives node status updates which are large inline puts). If in such a range a follower falls behind just slightly, it'll need a snapshot. This isn't in itself the biggest deal since the snapshot is fairly rare (the required log entries are usually already on in transit to the follower) and would be small, but it's not ideal. Always use a 4mb threshold instead. Note that we also truncate the log to the minimum replicated index if the log size is above 64kb. This is similarly aggressive but respects followers (until they fall behind by 4mb or more). My expectation is that this will not functionally change anything. It might leave behind a little bit more Raft log on quiescent ranges, but I think the solution here is performing "one last truncation" for ranges that are quiescent to make sure they shed the remainder of their Raft log. Touches #32046. Release note: None Co-authored-by: Tobias Schottdorf <[email protected]>
cockroachdb · Nov 19, 2018 · d52576a · d52576a
2 parents 7c2fd9d + 53cecf1
commit d52576a
Showing 1 changed file with 13 additions and 11 deletions.
diff --git a/pkg/storage/raft_log_queue.go b/pkg/storage/raft_log_queue.go
@@ -99,20 +99,22 @@ func newTruncateDecision(ctx context.Context, r *Replica) (*truncateDecision, er
 
 	r.mu.Lock()
 	raftLogSize := r.mu.raftLogSize
-	// We target the raft log size at the size of the replicated data. When
-	// writing to a replica, it is common for the raft log to become larger than
-	// the replicated data as the raft log contains the overhead of the
-	// BatchRequest which includes the full transaction state as well as begin
-	// and end transaction operations. If the estimated raft log size becomes
-	// larger than the replica size, we're better off recovering the replica
-	// using a snapshot.
-	targetSize := r.mu.state.Stats.Total()
+	// A "cooperative" truncation (i.e. one that does not cut off followers from
+	// the log) takes place whenever there are more than
+	// RaftLogQueueStaleThreshold entries or the log's estimated size is above
+	// RaftLogQueueStaleSize bytes. This is fairly aggressive, so under normal
+	// conditions, the log is very small.
+	//
+	// If followers start falling behind, at some point the logs still need to
+	// be truncated. We do this either when the size of the log exceeds
+	// RaftLogTruncationThreshold (or, in eccentric configurations, the zone's
+	// RangeMaxBytes). This captures the heuristic that at some point, it's more
+	// efficient to catch up via a snapshot than via applying a long tail of log
+	// entries.
+	targetSize := r.store.cfg.RaftLogTruncationThreshold
 	if targetSize > *r.mu.zone.RangeMaxBytes {
 		targetSize = *r.mu.zone.RangeMaxBytes
 	}
-	if targetSize > r.store.cfg.RaftLogTruncationThreshold {
-		targetSize = r.store.cfg.RaftLogTruncationThreshold
-	}
 	raftStatus := r.raftStatusRLocked()
 
 	firstIndex, err := r.raftFirstIndexLocked()