storage: `raft_log_queue` should be more aggressive #6012

bdarnell · 2016-04-12T17:27:39Z

Currently, raft_log_queue will never truncate log entries that are older than the replica that is the farthest behind. This leads to pathological cases in which a replica that is having trouble can result in unbounded log growth, as in these logs from #5970:

node4.log:W160406 14:17:15.637214 storage/raft_log_queue.go:116  storage/raft_log_queue.go:101: raft log's oldest index (0) is less than the first index (243127) for range 17
node4.log:W160406 14:17:35.639891 storage/raft_log_queue.go:116  storage/raft_log_queue.go:101: raft log's oldest index (0) is less than the first index (56364) for range 76
node4.log:W160406 14:23:50.864039 storage/raft_log_queue.go:116  storage/raft_log_queue.go:101: raft log's oldest index (0) is less than the first index (108869) for range 42
node4.log:W160406 14:23:55.867391 storage/raft_log_queue.go:116  storage/raft_log_queue.go:101: raft log's oldest index (0) is less than the first index (64298) for range 112

We should be willing to truncate applied log entries even if a lagging replica still needs them to prevent the log from getting too big. The raft protocol handles this gracefully and will fall back to sending a snapshot to help the lagging replica recover. We should still prefer to keep log entries around until all replicas are caught up because that's the most efficient path, but truncating the log is better than letting it grow without bound.

The text was updated successfully, but these errors were encountered:

petermattis · 2016-06-03T15:12:01Z

I'm going to take a look at this. Any ideas on what heuristic to use? Simplest would be to cap the growth at a certain number of entries. Capping based on the size (in bytes) of the raft log would be more intuitive in some ways (e.g. cap the raft log to 1/4 the range size), though more difficult to implement. Or we could implement some time-based mechanism where we keep entries for at least X minutes. Thoughts?

tbg · 2016-06-03T15:22:41Z

I think for the time being (and since this is an active issue), simple does it. You could refine the computation of the oldest truncatable index a little bit: As long as the group is active, there's a quorum - so only #quorum nodes are expected to matter for progress. So you throw away the (n-#quorum) smallest progress updates and the result is what you reasonably expect to have to keep. Pad that with a generous offset (so that in the normal case, this logic wouldn't trigger) and hopefully that'd do well in practice.

The upshot in the given example would be that the one failed node's progress is discarded and you use only the data of the two that are actually running.

petermattis · 2016-06-03T15:33:00Z

Yes about the quorum. I thought that was implicit in whatever we do here. Question is about what number of additional entries to keep beyond what the quorum has executed. 100? 1000?

tbg · 2016-06-03T15:51:28Z

1000 sounds like a good starting point. Easy enough to adjust if Ben
disagrees.

On Fri, Jun 3, 2016 at 11:33 AM Peter Mattis [email protected]
wrote:

Yes about the quorum. I thought that was implicit in whatever we do here.
Question is about what number of additional entries to keep beyond what the
quorum has executed. 100? 1000?

—
You are receiving this because you commented.

Reply to this email directly, view it on GitHub
#6012 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AE135Pmt98mB5tls1YK7tiVHBBV1nq-Aks5qIEk8gaJpZM4IFntp
.

-- Tobias

When deciding where to truncate the raft log, use the quorum applied index (minus `raftLogPadding`) instead of the oldest index. The oldest index can get excessively (and arbitrarily) old if a replica goes down for a significant period of time. See cockroachdb#6012.

petermattis · 2016-06-03T20:08:56Z

Note that #7040 is not a complete fix for raft log growth. After that change, we can still see excessive raft log growth due to the replica scanner interval which defaults to 10min. Hundreds of thousands of entries can be added to the raft log for a replica between scans. Which makes me wonder why we're triggering raft log truncation from a scanner queue. An alternative approach would be to trigger truncation after adding to the raft log. Thoughts?

bdarnell · 2016-06-03T23:50:17Z

Bytes are the best metric to track here. The logs should be truncated at around 64MB because at that point it will be cheaper to catch up with a snapshot rather than transfer the log. Time-based truncation could be useful as a fallback, but it's unlikely to be useful (since the repair queue should kick in before a time-based truncation would). The number of log entries is useful only as a proxy for byte size (so 1000 seems low. Maybe 64000?).

Agreed that a scanner queue is not ideal; we should add ranges to the range log queue when a command is applied, just like with maybeAddToSplitQueue.

petermattis · 2016-06-04T00:58:29Z

Thanks for the pointer to maybeAddToSplitQueue. Yes, that was what I was imagining and something like that should be done for raft log truncation. Agreed that log byte size is the right metric to be basing truncation on, though I'm not sure that we'd want to allow the log to grow to the same size as the range itself before truncating.

My plan is to get #7040 merged as it improves the heuristic for which index to truncate at over the current state, then make a change to check for truncation at the end of write ops similar to maybeAddToSplitQueue and lastly to change the truncation heuristic to be based on the log size in bytes.

When deciding where to truncate the raft log, use the quorum applied index (minus `raftLogPadding`) instead of the oldest index. The oldest index can get excessively (and arbitrarily) old if a replica goes down for a significant period of time. See cockroachdb#6012.

petermattis · 2016-06-04T15:08:25Z

Rather than adding replicas to the raft-log-queue after processing a request, perhaps we should be appending TruncateLogRequests to write batches before they are sent in to raft. In normal operation (i.e. all replicas up), this would amortize the cost of truncating log entries and would piggy back the truncate-log requests on existing RPCs.

bdarnell · 2016-06-04T15:16:24Z

Truncating the logs invalidates some caches so it's probably not a win to do it on every write. It also requires information that is only present on the raft leader so we may not be able to do it when we generate the batch. When we move raft logs out of rocksdb truncation is likely to become more coarse-grained.

petermattis · 2016-06-04T19:13:54Z

Which caches does truncating raft logs invalidate?

I thought the raft leader and range leader were usually synonymous. We wouldn't have to truncate at every opportunity. For example, we could have a heuristic where we only send a truncation request when there are at least 100 entries to truncate.

Do you see moving raft logs out of rocksdb something that will happen soon? Regardless, it seems desirable that the code path generating the raft logs also assumes some responsibility for cleaning them up. This is similar to memory garbage collection where some GC systems will "steal" an allocating thread in order to perform GC work (Go does this).

bdarnell · 2016-06-05T14:37:00Z

Which caches does truncating raft logs invalidate?

r.mu.truncatedState

I thought the raft leader and range leader were usually synonymous.

We now make an effort to do this but we haven't yet measured how well it works in practice. I wouldn't be surprised if the raft leadership transfer gets aborted with some frequency.

Do you see moving raft logs out of rocksdb something that will happen soon?

I don't, but Spencer is pretty keen on doing it.

Regardless, it seems desirable that the code path generating the raft logs also assumes some responsibility for cleaning them up.

Agreed. This may avoid the need for separate backpressure based on the raft log size.

When deciding where to truncate the raft log, use the quorum matched index (minus `raftLogPadding`) instead of the oldest index. The oldest index can get excessively (and arbitrarily) old if a replica goes down for a significant period of time. See cockroachdb#6012.

petermattis · 2016-06-05T18:43:04Z

r.mu.truncatedState doesn't need to be cleared on a TruncateLogRequest. We're writing the proto to disk so it seems reasonable we could update the in-memory cache. The complexity seems to be that we want to perform that update after the batch commits successfully, but that could be done with a Batch.Defer.

I'm not so keen on moving away from RocksDB for raft logs. That feels more like 2017 or later work.

Btw, if we truncate the raft log quickly enough, we might be able to avoid writing the raft log memtable entries to disk. That is, the raft log entry would get written to the RocksDB WAL and added to the memtable but if we delete it quickly enough the delete would also hit the memtable. I believe RocksDB optimizes this case and doesn't write a deletion tombstone if KeyMayExist returns false.

Instead of clearing the cached `Replica.truncatedState` after committing a batch which contains a `TruncateLogRequest`, we now update the cached data based on what was written to disk use a `Batch.Defer`. This follows the pattern used for `ChangeFrozenRequest` and `GCRequest`. See cockroachdb#6012.

When deciding where to truncate the raft log, use the quorum matched index (minus `raftLogPadding`) instead of the oldest index. The oldest index can get excessively (and arbitrarily) old if a replica goes down for a significant period of time. See cockroachdb#6012.

…write ops Fixes cockroachdb#6012.

Fixes cockroachdb#6012.

bdarnell mentioned this issue Apr 13, 2016

stability: raft issues after multiple days of load #5970

Closed

bdarnell modified the milestone: Q2 Apr 19, 2016

bdarnell mentioned this issue May 18, 2016

stability: large snapshots causing issues #6758

Closed

petermattis mentioned this issue Jun 3, 2016

stability: memory issues when pausing/resuming a node #7018

Closed

petermattis self-assigned this Jun 3, 2016

petermattis mentioned this issue Jun 3, 2016

storage: use quorum-applied-index instead of oldest-index #7040

Merged

petermattis mentioned this issue Jun 5, 2016

storage: update Replica.truncatedState in a defer #7053

Merged

petermattis mentioned this issue Jun 6, 2016

storage: use raft log size (in bytes) as metric for truncation #7065

Closed

petermattis added a commit to petermattis/cockroach that referenced this issue Jun 9, 2016

[DO NOT MERGE] storage: queue replicas for raft log truncation after …

7b0b4e1

…write ops Fixes cockroachdb#6012.

petermattis mentioned this issue Jun 9, 2016

storage: queue replicas for raft log truncation after write ops #7125

Merged

petermattis added a commit to petermattis/cockroach that referenced this issue Jun 9, 2016

[DO NOT MERGE] storage: queue replicas for raft log truncation after …

986b2c5

…write ops Fixes cockroachdb#6012.

petermattis added a commit to petermattis/cockroach that referenced this issue Jun 10, 2016

[DO NOT MERGE] storage: queue replicas for raft log truncation after …

4cba101

…write ops Fixes cockroachdb#6012.

petermattis added a commit to petermattis/cockroach that referenced this issue Jun 10, 2016

storage: queue replicas for raft log truncation after write ops

8c79418

Fixes cockroachdb#6012.

petermattis added a commit to petermattis/cockroach that referenced this issue Jun 14, 2016

storage: queue replicas for raft log truncation after write ops

0806632

Fixes cockroachdb#6012.

petermattis added a commit to petermattis/cockroach that referenced this issue Jun 14, 2016

storage: queue replicas for raft log truncation after write ops

5f8992b

Fixes cockroachdb#6012.

petermattis added a commit to petermattis/cockroach that referenced this issue Jun 14, 2016

storage: queue replicas for raft log truncation after write ops

53bf860

Fixes cockroachdb#6012.

d4l3k mentioned this issue Jun 14, 2016

storage: use raft log size (in bytes) as metric for truncation #7201

Closed

petermattis added a commit to petermattis/cockroach that referenced this issue Jun 14, 2016

storage: queue replicas for raft log truncation after write ops

d390791

Fixes cockroachdb#6012.

petermattis added a commit to petermattis/cockroach that referenced this issue Jun 14, 2016

storage: queue replicas for raft log truncation after write ops

a4b7a46

Fixes cockroachdb#6012.

petermattis closed this as completed in #7125 Jun 14, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: `raft_log_queue` should be more aggressive #6012

storage: `raft_log_queue` should be more aggressive #6012

bdarnell commented Apr 12, 2016

petermattis commented Jun 3, 2016

tbg commented Jun 3, 2016

petermattis commented Jun 3, 2016

tbg commented Jun 3, 2016

petermattis commented Jun 3, 2016

bdarnell commented Jun 3, 2016

petermattis commented Jun 4, 2016

petermattis commented Jun 4, 2016

bdarnell commented Jun 4, 2016

petermattis commented Jun 4, 2016

bdarnell commented Jun 5, 2016

petermattis commented Jun 5, 2016

storage: raft_log_queue should be more aggressive #6012

storage: raft_log_queue should be more aggressive #6012

Comments

bdarnell commented Apr 12, 2016

petermattis commented Jun 3, 2016

tbg commented Jun 3, 2016

petermattis commented Jun 3, 2016

tbg commented Jun 3, 2016

petermattis commented Jun 3, 2016

bdarnell commented Jun 3, 2016

petermattis commented Jun 4, 2016

petermattis commented Jun 4, 2016

bdarnell commented Jun 4, 2016

petermattis commented Jun 4, 2016

bdarnell commented Jun 5, 2016

petermattis commented Jun 5, 2016

storage: `raft_log_queue` should be more aggressive #6012

storage: `raft_log_queue` should be more aggressive #6012