Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: raft_log_queue should be more aggressive #6012

Closed
bdarnell opened this issue Apr 12, 2016 · 12 comments
Closed

storage: raft_log_queue should be more aggressive #6012

bdarnell opened this issue Apr 12, 2016 · 12 comments
Assignees
Milestone

Comments

@bdarnell
Copy link
Contributor

Currently, raft_log_queue will never truncate log entries that are older than the replica that is the farthest behind. This leads to pathological cases in which a replica that is having trouble can result in unbounded log growth, as in these logs from #5970:

node4.log:W160406 14:17:15.637214 storage/raft_log_queue.go:116  storage/raft_log_queue.go:101: raft log's oldest index (0) is less than the first index (243127) for range 17
node4.log:W160406 14:17:35.639891 storage/raft_log_queue.go:116  storage/raft_log_queue.go:101: raft log's oldest index (0) is less than the first index (56364) for range 76
node4.log:W160406 14:23:50.864039 storage/raft_log_queue.go:116  storage/raft_log_queue.go:101: raft log's oldest index (0) is less than the first index (108869) for range 42
node4.log:W160406 14:23:55.867391 storage/raft_log_queue.go:116  storage/raft_log_queue.go:101: raft log's oldest index (0) is less than the first index (64298) for range 112

We should be willing to truncate applied log entries even if a lagging replica still needs them to prevent the log from getting too big. The raft protocol handles this gracefully and will fall back to sending a snapshot to help the lagging replica recover. We should still prefer to keep log entries around until all replicas are caught up because that's the most efficient path, but truncating the log is better than letting it grow without bound.

@petermattis
Copy link
Collaborator

I'm going to take a look at this. Any ideas on what heuristic to use? Simplest would be to cap the growth at a certain number of entries. Capping based on the size (in bytes) of the raft log would be more intuitive in some ways (e.g. cap the raft log to 1/4 the range size), though more difficult to implement. Or we could implement some time-based mechanism where we keep entries for at least X minutes. Thoughts?

@tbg
Copy link
Member

tbg commented Jun 3, 2016

I think for the time being (and since this is an active issue), simple does it. You could refine the computation of the oldest truncatable index a little bit: As long as the group is active, there's a quorum - so only #quorum nodes are expected to matter for progress. So you throw away the (n-#quorum) smallest progress updates and the result is what you reasonably expect to have to keep. Pad that with a generous offset (so that in the normal case, this logic wouldn't trigger) and hopefully that'd do well in practice.

The upshot in the given example would be that the one failed node's progress is discarded and you use only the data of the two that are actually running.

@petermattis
Copy link
Collaborator

Yes about the quorum. I thought that was implicit in whatever we do here. Question is about what number of additional entries to keep beyond what the quorum has executed. 100? 1000?

@tbg
Copy link
Member

tbg commented Jun 3, 2016

1000 sounds like a good starting point. Easy enough to adjust if Ben
disagrees.

On Fri, Jun 3, 2016 at 11:33 AM Peter Mattis [email protected]
wrote:

Yes about the quorum. I thought that was implicit in whatever we do here.
Question is about what number of additional entries to keep beyond what the
quorum has executed. 100? 1000?


You are receiving this because you commented.

Reply to this email directly, view it on GitHub
#6012 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AE135Pmt98mB5tls1YK7tiVHBBV1nq-Aks5qIEk8gaJpZM4IFntp
.

-- Tobias

petermattis added a commit to petermattis/cockroach that referenced this issue Jun 3, 2016
When deciding where to truncate the raft log, use the quorum applied
index (minus `raftLogPadding`) instead of the oldest index. The oldest
index can get excessively (and arbitrarily) old if a replica goes down
for a significant period of time.

See cockroachdb#6012.
@petermattis
Copy link
Collaborator

Note that #7040 is not a complete fix for raft log growth. After that change, we can still see excessive raft log growth due to the replica scanner interval which defaults to 10min. Hundreds of thousands of entries can be added to the raft log for a replica between scans. Which makes me wonder why we're triggering raft log truncation from a scanner queue. An alternative approach would be to trigger truncation after adding to the raft log. Thoughts?

@bdarnell
Copy link
Contributor Author

bdarnell commented Jun 3, 2016

Bytes are the best metric to track here. The logs should be truncated at around 64MB because at that point it will be cheaper to catch up with a snapshot rather than transfer the log. Time-based truncation could be useful as a fallback, but it's unlikely to be useful (since the repair queue should kick in before a time-based truncation would). The number of log entries is useful only as a proxy for byte size (so 1000 seems low. Maybe 64000?).

Agreed that a scanner queue is not ideal; we should add ranges to the range log queue when a command is applied, just like with maybeAddToSplitQueue.

@petermattis
Copy link
Collaborator

Thanks for the pointer to maybeAddToSplitQueue. Yes, that was what I was imagining and something like that should be done for raft log truncation. Agreed that log byte size is the right metric to be basing truncation on, though I'm not sure that we'd want to allow the log to grow to the same size as the range itself before truncating.

My plan is to get #7040 merged as it improves the heuristic for which index to truncate at over the current state, then make a change to check for truncation at the end of write ops similar to maybeAddToSplitQueue and lastly to change the truncation heuristic to be based on the log size in bytes.

petermattis added a commit to petermattis/cockroach that referenced this issue Jun 4, 2016
When deciding where to truncate the raft log, use the quorum applied
index (minus `raftLogPadding`) instead of the oldest index. The oldest
index can get excessively (and arbitrarily) old if a replica goes down
for a significant period of time.

See cockroachdb#6012.
@petermattis
Copy link
Collaborator

Rather than adding replicas to the raft-log-queue after processing a request, perhaps we should be appending TruncateLogRequests to write batches before they are sent in to raft. In normal operation (i.e. all replicas up), this would amortize the cost of truncating log entries and would piggy back the truncate-log requests on existing RPCs.

@bdarnell
Copy link
Contributor Author

bdarnell commented Jun 4, 2016

Truncating the logs invalidates some caches so it's probably not a win to do it on every write. It also requires information that is only present on the raft leader so we may not be able to do it when we generate the batch. When we move raft logs out of rocksdb truncation is likely to become more coarse-grained.

@petermattis
Copy link
Collaborator

Which caches does truncating raft logs invalidate?

I thought the raft leader and range leader were usually synonymous. We wouldn't have to truncate at every opportunity. For example, we could have a heuristic where we only send a truncation request when there are at least 100 entries to truncate.

Do you see moving raft logs out of rocksdb something that will happen soon? Regardless, it seems desirable that the code path generating the raft logs also assumes some responsibility for cleaning them up. This is similar to memory garbage collection where some GC systems will "steal" an allocating thread in order to perform GC work (Go does this).

@bdarnell
Copy link
Contributor Author

bdarnell commented Jun 5, 2016

Which caches does truncating raft logs invalidate?

r.mu.truncatedState

I thought the raft leader and range leader were usually synonymous.

We now make an effort to do this but we haven't yet measured how well it works in practice. I wouldn't be surprised if the raft leadership transfer gets aborted with some frequency.

Do you see moving raft logs out of rocksdb something that will happen soon?

I don't, but Spencer is pretty keen on doing it.

Regardless, it seems desirable that the code path generating the raft logs also assumes some responsibility for cleaning them up.

Agreed. This may avoid the need for separate backpressure based on the raft log size.

petermattis added a commit to petermattis/cockroach that referenced this issue Jun 5, 2016
When deciding where to truncate the raft log, use the quorum matched
index (minus `raftLogPadding`) instead of the oldest index. The oldest
index can get excessively (and arbitrarily) old if a replica goes down
for a significant period of time.

See cockroachdb#6012.
@petermattis
Copy link
Collaborator

r.mu.truncatedState doesn't need to be cleared on a TruncateLogRequest. We're writing the proto to disk so it seems reasonable we could update the in-memory cache. The complexity seems to be that we want to perform that update after the batch commits successfully, but that could be done with a Batch.Defer.

I'm not so keen on moving away from RocksDB for raft logs. That feels more like 2017 or later work.

Btw, if we truncate the raft log quickly enough, we might be able to avoid writing the raft log memtable entries to disk. That is, the raft log entry would get written to the RocksDB WAL and added to the memtable but if we delete it quickly enough the delete would also hit the memtable. I believe RocksDB optimizes this case and doesn't write a deletion tombstone if KeyMayExist returns false.

petermattis added a commit to petermattis/cockroach that referenced this issue Jun 5, 2016
Instead of clearing the cached `Replica.truncatedState` after committing
a batch which contains a `TruncateLogRequest`, we now update the cached
data based on what was written to disk use a `Batch.Defer`. This follows
the pattern used for `ChangeFrozenRequest` and `GCRequest`.

See cockroachdb#6012.
petermattis added a commit to petermattis/cockroach that referenced this issue Jun 6, 2016
When deciding where to truncate the raft log, use the quorum matched
index (minus `raftLogPadding`) instead of the oldest index. The oldest
index can get excessively (and arbitrarily) old if a replica goes down
for a significant period of time.

See cockroachdb#6012.
petermattis added a commit to petermattis/cockroach that referenced this issue Jun 9, 2016
petermattis added a commit to petermattis/cockroach that referenced this issue Jun 9, 2016
petermattis added a commit to petermattis/cockroach that referenced this issue Jun 10, 2016
petermattis added a commit to petermattis/cockroach that referenced this issue Jun 10, 2016
petermattis added a commit to petermattis/cockroach that referenced this issue Jun 14, 2016
petermattis added a commit to petermattis/cockroach that referenced this issue Jun 14, 2016
petermattis added a commit to petermattis/cockroach that referenced this issue Jun 14, 2016
petermattis added a commit to petermattis/cockroach that referenced this issue Jun 14, 2016
petermattis added a commit to petermattis/cockroach that referenced this issue Jun 14, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants