storage: use raft log size (in bytes) as metric for truncation #7065

petermattis · 2016-06-06T17:59:49Z

Raft log truncation currently uses the number of entries in the raft log as the metric for truncation. Using the size in bytes is preferable as it makes the tradeoff between using a snapshot or the raft log more obvious. The implementation should be fairly straightforward as we write raft log entries in Replica.append and delete them in Replica.append and Replica.TruncateLog. The size of the raft log would be stored under a new unreplicated local range key (keys.RaftLogSizeKey(rangeID)) and cached in memory.

See #6012.

The text was updated successfully, but these errors were encountered:

d4l3k · 2016-06-08T22:59:50Z

@petermattis Is there any reason that we need to store the size of the raft long other than in memory? Servers don't restart very often so we could just do a range scan over the raft log entries to figure out the size at start up.

petermattis · 2016-06-09T00:13:54Z

My implementation description was following the approach of other raft-log state such as the last index. While it is true that servers don't restart often, when they do we don't want there to be a time consuming scan operation. That said, @bdarnell should also comment on this approach. Avoiding additional persistent state would be nice if it doesn't come with significant downsides.

d4l3k · 2016-06-09T16:28:46Z

We're going to need code to do the scan anyways to upgrade old nodes. Might as well have that be the primary mechanism? Though I guess you'd potentially have to scan 64MB of raft log per range which could be a lot.

petermattis · 2016-06-09T17:45:39Z

That's a good point about needing the mechanism for upgrades. My concern is one of scale. If it takes 100ms to scan through the raft log to calculate its size we've added significant startup time to load 1000 ranges. Seems like you can get started on the scanning mechanism as we're going to need it and we can wait for @bdarnell to weigh in on whether we should persistently store the length in bytes.

bdarnell · 2016-06-11T21:42:33Z

It's important that we store the last index instead of scanning for it because the range is basically useless if it doesn't know its last index; we must block at startup while doing this scan. We could scan for the range size asynchronously (modulo some tricky synchronization with log entries written during the scan); we just wouldn't be able to GC the log until it had completed. So it might be OK to avoid the persistent state and just scan the logs at startup.

On the other hand, consistency of this value is not that important (it's not in the consistency checker's scope) so we could do without the scan even as a one-time upgrade. We could start the counter at zero and GC the log when enough new entries have been written, regardless of how much had been present in the logs from previous runs.

On the third hand, one of the reasons we've wanted to avoid scanning the logs is because we let them grow so large. If we're better about log GC, maybe we can afford to just do the scan and block. This makes me nervous, though. It could set us up for trouble when something else prevents log GC.

Overall, I think my preference is to store the log size persistently, but we can skip the backfill unless it's easy.

tbg · 2016-06-11T21:53:23Z

An option which hasn't been discussed is simply not restoring the in-memory
value at all. With servers being fairly long lived and there not being a
real need to get the number right, we could just start counting from
"morally zero" (and realistically a random value between zero and our
threshold to avoid all logs clashing and growing too much until they do)
when a server comes up. After the first truncation, everyone would have the
right number. That seems dead simple and I would think that should perform
well in practice, without the need to hit the disk for this sort of thing
at all.

On Sat, Jun 11, 2016 at 5:42 PM Ben Darnell notifications@github.com
wrote:

It's important that we store the last index instead of scanning for it
because the range is basically useless if it doesn't know its last index;
we must block at startup while doing this scan. We could scan for the range
size asynchronously (modulo some tricky synchronization with log entries
written during the scan); we just wouldn't be able to GC the log until it
had completed. So it might be OK to avoid the persistent state and just
scan the logs at startup.

On the other hand, consistency of this value is not that important (it's
not in the consistency checker's scope) so we could do without the scan
even as a one-time upgrade. We could start the counter at zero and GC the
log when enough new entries have been written, regardless of how much
had been present in the logs from previous runs.

On the third hand, one of the reasons we've wanted to avoid scanning the
logs is because we let them grow so large. If we're better about log GC,
maybe we can afford to just do the scan and block. This makes me nervous,
though. It could set us up for trouble when something else prevents log GC.

Overall, I think my preference is to store the log size persistently, but
we can skip the backfill unless it's easy.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#7065 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AE135D64fe_kNs_syaxHDsJGTxmh9KJsks5qKyvYgaJpZM4IvJ3Q
.

-- Tobias

bdarnell · 2016-06-11T22:18:15Z

Yeah, I had that in mind in my second option but didn't spell it out. Since we're already storing things like the last and applied index, though, I think it's better to move towards storing this as well so that all replicas will make the same decisions about GC (instead of that decision varying based on who is leader).

The raft log currently truncates based of the number of entries (max 10000). These entries can be of any size so it makes more sense to use the byte size of the raft log as a metric for truncation. This change will make the raft log be used only when it is smaller than sending a snapshot. See cockroachdb#7065.

The raft log currently truncates based of the number of entries (max 10000). These entries can be of any size so it makes more sense to use the byte size of the raft log as a metric for truncation. See cockroachdb#7065.

The raft log currently truncates based of the number of entries (max 10000). These entries can be of any size so it makes more sense to use the byte size of the raft log as a metric for truncation. If the raft log is greater than 64MB and there is a behind node, it will truncate the log to the quorum committed index. See cockroachdb#7065.

The raft log currently truncates based of the number of entries (max 10000). These entries can be of any size so it makes more sense to use the byte size of the raft log as a metric for truncation. If the raft log is greater than 64MB and there is a behind node, it will truncate the log to the quorum committed index. The in-memory size is the approximate size in bytes of the persisted raft log. On server restart, this value is assumed to be zero to avoid costly scans of the raft log. After the first raft log truncation it will be correct. See cockroachdb#7065.

petermattis · 2016-07-11T13:57:06Z

@d4l3k With #7438 merged, can this issue be closed?

petermattis assigned petermattis and d4l3k Jun 6, 2016

d4l3k mentioned this issue Jun 13, 2016

storage: use raft log size (in bytes) as metric for truncation #7201

Closed

d4l3k mentioned this issue Jun 23, 2016

storage: use raft log size (in bytes) as metric for truncation #7438

Merged

petermattis added this to the Q3 milestone Jul 11, 2016

d4l3k closed this as completed Jul 11, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: use raft log size (in bytes) as metric for truncation #7065

storage: use raft log size (in bytes) as metric for truncation #7065

petermattis commented Jun 6, 2016

d4l3k commented Jun 8, 2016

petermattis commented Jun 9, 2016

d4l3k commented Jun 9, 2016 •

edited

Loading

petermattis commented Jun 9, 2016

bdarnell commented Jun 11, 2016

tbg commented Jun 11, 2016

bdarnell commented Jun 11, 2016

petermattis commented Jul 11, 2016

storage: use raft log size (in bytes) as metric for truncation #7065

storage: use raft log size (in bytes) as metric for truncation #7065

Comments

petermattis commented Jun 6, 2016

d4l3k commented Jun 8, 2016

petermattis commented Jun 9, 2016

d4l3k commented Jun 9, 2016 • edited Loading

petermattis commented Jun 9, 2016

bdarnell commented Jun 11, 2016

tbg commented Jun 11, 2016

bdarnell commented Jun 11, 2016

petermattis commented Jul 11, 2016

d4l3k commented Jun 9, 2016 •

edited

Loading