-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: raft_log_queue
should be more aggressive
#6012
Comments
I'm going to take a look at this. Any ideas on what heuristic to use? Simplest would be to cap the growth at a certain number of entries. Capping based on the size (in bytes) of the raft log would be more intuitive in some ways (e.g. cap the raft log to 1/4 the range size), though more difficult to implement. Or we could implement some time-based mechanism where we keep entries for at least X minutes. Thoughts? |
I think for the time being (and since this is an active issue), simple does it. You could refine the computation of the oldest truncatable index a little bit: As long as the group is active, there's a quorum - so only #quorum nodes are expected to matter for progress. So you throw away the (n-#quorum) smallest progress updates and the result is what you reasonably expect to have to keep. Pad that with a generous offset (so that in the normal case, this logic wouldn't trigger) and hopefully that'd do well in practice. The upshot in the given example would be that the one failed node's progress is discarded and you use only the data of the two that are actually running. |
Yes about the quorum. I thought that was implicit in whatever we do here. Question is about what number of additional entries to keep beyond what the quorum has executed. 100? 1000? |
1000 sounds like a good starting point. Easy enough to adjust if Ben On Fri, Jun 3, 2016 at 11:33 AM Peter Mattis [email protected]
-- Tobias |
When deciding where to truncate the raft log, use the quorum applied index (minus `raftLogPadding`) instead of the oldest index. The oldest index can get excessively (and arbitrarily) old if a replica goes down for a significant period of time. See cockroachdb#6012.
Note that #7040 is not a complete fix for raft log growth. After that change, we can still see excessive raft log growth due to the replica scanner interval which defaults to 10min. Hundreds of thousands of entries can be added to the raft log for a replica between scans. Which makes me wonder why we're triggering raft log truncation from a scanner queue. An alternative approach would be to trigger truncation after adding to the raft log. Thoughts? |
Bytes are the best metric to track here. The logs should be truncated at around 64MB because at that point it will be cheaper to catch up with a snapshot rather than transfer the log. Time-based truncation could be useful as a fallback, but it's unlikely to be useful (since the repair queue should kick in before a time-based truncation would). The number of log entries is useful only as a proxy for byte size (so 1000 seems low. Maybe 64000?). Agreed that a scanner queue is not ideal; we should add ranges to the range log queue when a command is applied, just like with maybeAddToSplitQueue. |
Thanks for the pointer to My plan is to get #7040 merged as it improves the heuristic for which index to truncate at over the current state, then make a change to check for truncation at the end of write ops similar to |
When deciding where to truncate the raft log, use the quorum applied index (minus `raftLogPadding`) instead of the oldest index. The oldest index can get excessively (and arbitrarily) old if a replica goes down for a significant period of time. See cockroachdb#6012.
Rather than adding replicas to the raft-log-queue after processing a request, perhaps we should be appending |
Truncating the logs invalidates some caches so it's probably not a win to do it on every write. It also requires information that is only present on the raft leader so we may not be able to do it when we generate the batch. When we move raft logs out of rocksdb truncation is likely to become more coarse-grained. |
Which caches does truncating raft logs invalidate? I thought the raft leader and range leader were usually synonymous. We wouldn't have to truncate at every opportunity. For example, we could have a heuristic where we only send a truncation request when there are at least 100 entries to truncate. Do you see moving raft logs out of rocksdb something that will happen soon? Regardless, it seems desirable that the code path generating the raft logs also assumes some responsibility for cleaning them up. This is similar to memory garbage collection where some GC systems will "steal" an allocating thread in order to perform GC work (Go does this). |
We now make an effort to do this but we haven't yet measured how well it works in practice. I wouldn't be surprised if the raft leadership transfer gets aborted with some frequency.
I don't, but Spencer is pretty keen on doing it.
Agreed. This may avoid the need for separate backpressure based on the raft log size. |
When deciding where to truncate the raft log, use the quorum matched index (minus `raftLogPadding`) instead of the oldest index. The oldest index can get excessively (and arbitrarily) old if a replica goes down for a significant period of time. See cockroachdb#6012.
I'm not so keen on moving away from RocksDB for raft logs. That feels more like 2017 or later work. Btw, if we truncate the raft log quickly enough, we might be able to avoid writing the raft log memtable entries to disk. That is, the raft log entry would get written to the RocksDB WAL and added to the memtable but if we delete it quickly enough the delete would also hit the memtable. I believe RocksDB optimizes this case and doesn't write a deletion tombstone if |
Instead of clearing the cached `Replica.truncatedState` after committing a batch which contains a `TruncateLogRequest`, we now update the cached data based on what was written to disk use a `Batch.Defer`. This follows the pattern used for `ChangeFrozenRequest` and `GCRequest`. See cockroachdb#6012.
When deciding where to truncate the raft log, use the quorum matched index (minus `raftLogPadding`) instead of the oldest index. The oldest index can get excessively (and arbitrarily) old if a replica goes down for a significant period of time. See cockroachdb#6012.
Currently,
raft_log_queue
will never truncate log entries that are older than the replica that is the farthest behind. This leads to pathological cases in which a replica that is having trouble can result in unbounded log growth, as in these logs from #5970:We should be willing to truncate applied log entries even if a lagging replica still needs them to prevent the log from getting too big. The raft protocol handles this gracefully and will fall back to sending a snapshot to help the lagging replica recover. We should still prefer to keep log entries around until all replicas are caught up because that's the most efficient path, but truncating the log is better than letting it grow without bound.
The text was updated successfully, but these errors were encountered: