Skip to content

Commit

Permalink
storage: truncate aggressively only after 4mb of logs
Browse files Browse the repository at this point in the history
Whenever the "max raft log size" is exceeded, log truncations become
more aggressive in that they aim at the quorum commit index, potentially
cutting off followers (which then need Raft snapshots).

The effective threshold log size is 4mb for replicas larger than 4mb and
the replica size otherwise. This latter case can be problematic since
replicas can be persistently small despite having steady log progress
(for example, range 4 receives node status updates which are large
inline puts). If in such a range a follower falls behind just slightly,
it'll need a snapshot. This isn't in itself the biggest deal since the
snapshot is fairly rare (the required log entries are usually already on
in transit to the follower) and would be small, but it's not ideal.

Always use a 4mb threshold instead. Note that we also truncate the log
to the minimum replicated index if the log size is above 64kb. This is
similarly aggressive but respects followers (until they fall behind by
4mb or more).

My expectation is that this will not functionally change anything. It
might leave behind a little bit more Raft log on quiescent ranges, but I
think the solution here is performing "one last truncation" for ranges
that are quiescent to make sure they shed the remainder of their Raft
log.

Touches #32046.

Release note: None
  • Loading branch information
tbg committed Nov 19, 2018
1 parent b49107e commit 53cecf1
Showing 1 changed file with 13 additions and 11 deletions.
24 changes: 13 additions & 11 deletions pkg/storage/raft_log_queue.go
Original file line number Diff line number Diff line change
Expand Up @@ -99,20 +99,22 @@ func newTruncateDecision(ctx context.Context, r *Replica) (*truncateDecision, er

r.mu.Lock()
raftLogSize := r.mu.raftLogSize
// We target the raft log size at the size of the replicated data. When
// writing to a replica, it is common for the raft log to become larger than
// the replicated data as the raft log contains the overhead of the
// BatchRequest which includes the full transaction state as well as begin
// and end transaction operations. If the estimated raft log size becomes
// larger than the replica size, we're better off recovering the replica
// using a snapshot.
targetSize := r.mu.state.Stats.Total()
// A "cooperative" truncation (i.e. one that does not cut off followers from
// the log) takes place whenever there are more than
// RaftLogQueueStaleThreshold entries or the log's estimated size is above
// RaftLogQueueStaleSize bytes. This is fairly aggressive, so under normal
// conditions, the log is very small.
//
// If followers start falling behind, at some point the logs still need to
// be truncated. We do this either when the size of the log exceeds
// RaftLogTruncationThreshold (or, in eccentric configurations, the zone's
// RangeMaxBytes). This captures the heuristic that at some point, it's more
// efficient to catch up via a snapshot than via applying a long tail of log
// entries.
targetSize := r.store.cfg.RaftLogTruncationThreshold
if targetSize > *r.mu.zone.RangeMaxBytes {
targetSize = *r.mu.zone.RangeMaxBytes
}
if targetSize > r.store.cfg.RaftLogTruncationThreshold {
targetSize = r.store.cfg.RaftLogTruncationThreshold
}
raftStatus := r.raftStatusRLocked()

firstIndex, err := r.raftFirstIndexLocked()
Expand Down

0 comments on commit 53cecf1

Please sign in to comment.