Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: use raft log size (in bytes) as metric for truncation #7065

Closed
petermattis opened this issue Jun 6, 2016 · 8 comments
Closed

storage: use raft log size (in bytes) as metric for truncation #7065

petermattis opened this issue Jun 6, 2016 · 8 comments
Assignees
Milestone

Comments

@petermattis
Copy link
Collaborator

Raft log truncation currently uses the number of entries in the raft log as the metric for truncation. Using the size in bytes is preferable as it makes the tradeoff between using a snapshot or the raft log more obvious. The implementation should be fairly straightforward as we write raft log entries in Replica.append and delete them in Replica.append and Replica.TruncateLog. The size of the raft log would be stored under a new unreplicated local range key (keys.RaftLogSizeKey(rangeID)) and cached in memory.

See #6012.

@d4l3k
Copy link
Contributor

d4l3k commented Jun 8, 2016

@petermattis Is there any reason that we need to store the size of the raft long other than in memory? Servers don't restart very often so we could just do a range scan over the raft log entries to figure out the size at start up.

@petermattis
Copy link
Collaborator Author

My implementation description was following the approach of other raft-log state such as the last index. While it is true that servers don't restart often, when they do we don't want there to be a time consuming scan operation. That said, @bdarnell should also comment on this approach. Avoiding additional persistent state would be nice if it doesn't come with significant downsides.

@d4l3k
Copy link
Contributor

d4l3k commented Jun 9, 2016

We're going to need code to do the scan anyways to upgrade old nodes. Might as well have that be the primary mechanism? Though I guess you'd potentially have to scan 64MB of raft log per range which could be a lot.

@petermattis
Copy link
Collaborator Author

That's a good point about needing the mechanism for upgrades. My concern is one of scale. If it takes 100ms to scan through the raft log to calculate its size we've added significant startup time to load 1000 ranges. Seems like you can get started on the scanning mechanism as we're going to need it and we can wait for @bdarnell to weigh in on whether we should persistently store the length in bytes.

@bdarnell
Copy link
Contributor

It's important that we store the last index instead of scanning for it because the range is basically useless if it doesn't know its last index; we must block at startup while doing this scan. We could scan for the range size asynchronously (modulo some tricky synchronization with log entries written during the scan); we just wouldn't be able to GC the log until it had completed. So it might be OK to avoid the persistent state and just scan the logs at startup.

On the other hand, consistency of this value is not that important (it's not in the consistency checker's scope) so we could do without the scan even as a one-time upgrade. We could start the counter at zero and GC the log when enough new entries have been written, regardless of how much had been present in the logs from previous runs.

On the third hand, one of the reasons we've wanted to avoid scanning the logs is because we let them grow so large. If we're better about log GC, maybe we can afford to just do the scan and block. This makes me nervous, though. It could set us up for trouble when something else prevents log GC.

Overall, I think my preference is to store the log size persistently, but we can skip the backfill unless it's easy.

@tbg
Copy link
Member

tbg commented Jun 11, 2016

An option which hasn't been discussed is simply not restoring the in-memory
value at all. With servers being fairly long lived and there not being a
real need to get the number right, we could just start counting from
"morally zero" (and realistically a random value between zero and our
threshold to avoid all logs clashing and growing too much until they do)
when a server comes up. After the first truncation, everyone would have the
right number. That seems dead simple and I would think that should perform
well in practice, without the need to hit the disk for this sort of thing
at all.

On Sat, Jun 11, 2016 at 5:42 PM Ben Darnell notifications@github.com
wrote:

It's important that we store the last index instead of scanning for it
because the range is basically useless if it doesn't know its last index;
we must block at startup while doing this scan. We could scan for the range
size asynchronously (modulo some tricky synchronization with log entries
written during the scan); we just wouldn't be able to GC the log until it
had completed. So it might be OK to avoid the persistent state and just
scan the logs at startup.

On the other hand, consistency of this value is not that important (it's
not in the consistency checker's scope) so we could do without the scan
even as a one-time upgrade. We could start the counter at zero and GC the
log when enough new entries have been written, regardless of how much
had been present in the logs from previous runs.

On the third hand, one of the reasons we've wanted to avoid scanning the
logs is because we let them grow so large. If we're better about log GC,
maybe we can afford to just do the scan and block. This makes me nervous,
though. It could set us up for trouble when something else prevents log GC.

Overall, I think my preference is to store the log size persistently, but
we can skip the backfill unless it's easy.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#7065 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AE135D64fe_kNs_syaxHDsJGTxmh9KJsks5qKyvYgaJpZM4IvJ3Q
.

-- Tobias

@bdarnell
Copy link
Contributor

Yeah, I had that in mind in my second option but didn't spell it out. Since we're already storing things like the last and applied index, though, I think it's better to move towards storing this as well so that all replicas will make the same decisions about GC (instead of that decision varying based on who is leader).

d4l3k added a commit to d4l3k/cockroach that referenced this issue Jun 13, 2016

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
The raft log currently truncates based of the number of entries (max 10000).
These entries can be of any size so it makes more sense to use the byte size of
the raft log as a metric for truncation. This change will make the raft log be
used only when it is smaller than sending a snapshot.

See cockroachdb#7065.
d4l3k added a commit to d4l3k/cockroach that referenced this issue Jun 14, 2016
The raft log currently truncates based of the number of entries (max 10000).
These entries can be of any size so it makes more sense to use the byte size of
the raft log as a metric for truncation. This change will make the raft log be
used only when it is smaller than sending a snapshot.

See cockroachdb#7065.
d4l3k added a commit to d4l3k/cockroach that referenced this issue Jun 14, 2016
The raft log currently truncates based of the number of entries (max 10000).
These entries can be of any size so it makes more sense to use the byte size of
the raft log as a metric for truncation. This change will make the raft log be
used only when it is smaller than sending a snapshot.

See cockroachdb#7065.
d4l3k added a commit to d4l3k/cockroach that referenced this issue Jun 14, 2016
The raft log currently truncates based of the number of entries (max 10000).
These entries can be of any size so it makes more sense to use the byte size of
the raft log as a metric for truncation.

See cockroachdb#7065.
d4l3k added a commit to d4l3k/cockroach that referenced this issue Jun 15, 2016
The raft log currently truncates based of the number of entries (max 10000).
These entries can be of any size so it makes more sense to use the byte size of
the raft log as a metric for truncation.

See cockroachdb#7065.
d4l3k added a commit to d4l3k/cockroach that referenced this issue Jun 15, 2016
The raft log currently truncates based of the number of entries (max 10000).
These entries can be of any size so it makes more sense to use the byte size of
the raft log as a metric for truncation.

See cockroachdb#7065.
d4l3k added a commit to d4l3k/cockroach that referenced this issue Jun 15, 2016
The raft log currently truncates based of the number of entries (max 10000).
These entries can be of any size so it makes more sense to use the byte size of
the raft log as a metric for truncation.

See cockroachdb#7065.
d4l3k added a commit to d4l3k/cockroach that referenced this issue Jun 15, 2016
The raft log currently truncates based of the number of entries (max 10000).
These entries can be of any size so it makes more sense to use the byte size of
the raft log as a metric for truncation.

See cockroachdb#7065.
d4l3k added a commit to d4l3k/cockroach that referenced this issue Jun 16, 2016
The raft log currently truncates based of the number of entries (max 10000).
These entries can be of any size so it makes more sense to use the byte size of
the raft log as a metric for truncation.

See cockroachdb#7065.
d4l3k added a commit to d4l3k/cockroach that referenced this issue Jun 16, 2016
The raft log currently truncates based of the number of entries (max 10000).
These entries can be of any size so it makes more sense to use the byte size of
the raft log as a metric for truncation.

See cockroachdb#7065.
d4l3k added a commit to d4l3k/cockroach that referenced this issue Jun 16, 2016
The raft log currently truncates based of the number of entries (max 10000).
These entries can be of any size so it makes more sense to use the byte size of
the raft log as a metric for truncation.

See cockroachdb#7065.
d4l3k added a commit to d4l3k/cockroach that referenced this issue Jun 16, 2016
The raft log currently truncates based of the number of entries (max 10000).
These entries can be of any size so it makes more sense to use the byte size of
the raft log as a metric for truncation.

See cockroachdb#7065.
d4l3k added a commit to d4l3k/cockroach that referenced this issue Jun 16, 2016
The raft log currently truncates based of the number of entries (max 10000).
These entries can be of any size so it makes more sense to use the byte size of
the raft log as a metric for truncation.

See cockroachdb#7065.
d4l3k added a commit to d4l3k/cockroach that referenced this issue Jun 20, 2016
The raft log currently truncates based of the number of entries (max 10000).
These entries can be of any size so it makes more sense to use the byte size of
the raft log as a metric for truncation.

See cockroachdb#7065.
d4l3k added a commit to d4l3k/cockroach that referenced this issue Jun 23, 2016
The raft log currently truncates based of the number of entries (max 10000).
These entries can be of any size so it makes more sense to use the byte size of
the raft log as a metric for truncation.

If the raft log is greater than 64MB and there is a behind node, it will
truncate the log to the quorum committed index.

See cockroachdb#7065.
d4l3k added a commit to d4l3k/cockroach that referenced this issue Jun 23, 2016
The raft log currently truncates based of the number of entries (max 10000).
These entries can be of any size so it makes more sense to use the byte size of
the raft log as a metric for truncation.

If the raft log is greater than 64MB and there is a behind node, it will
truncate the log to the quorum committed index.

See cockroachdb#7065.
d4l3k added a commit to d4l3k/cockroach that referenced this issue Jun 23, 2016
The raft log currently truncates based of the number of entries (max 10000).
These entries can be of any size so it makes more sense to use the byte size of
the raft log as a metric for truncation.

If the raft log is greater than 64MB and there is a behind node, it will
truncate the log to the quorum committed index.

See cockroachdb#7065.
d4l3k added a commit to d4l3k/cockroach that referenced this issue Jun 23, 2016
The raft log currently truncates based of the number of entries (max 10000).
These entries can be of any size so it makes more sense to use the byte size of
the raft log as a metric for truncation.

If the raft log is greater than 64MB and there is a behind node, it will
truncate the log to the quorum committed index.

See cockroachdb#7065.
d4l3k added a commit to d4l3k/cockroach that referenced this issue Jun 23, 2016
The raft log currently truncates based of the number of entries (max 10000).
These entries can be of any size so it makes more sense to use the byte size of
the raft log as a metric for truncation.

If the raft log is greater than 64MB and there is a behind node, it will
truncate the log to the quorum committed index.

See cockroachdb#7065.
d4l3k added a commit to d4l3k/cockroach that referenced this issue Jun 29, 2016
The raft log currently truncates based of the number of entries (max 10000).
These entries can be of any size so it makes more sense to use the byte size of
the raft log as a metric for truncation.

If the raft log is greater than 64MB and there is a behind node, it will
truncate the log to the quorum committed index.

The in-memory size is the approximate size in bytes of the persisted raft log.
On server restart, this value is assumed to be zero to avoid costly scans of the
raft log. After the first raft log truncation it will be correct.

See cockroachdb#7065.
d4l3k added a commit to d4l3k/cockroach that referenced this issue Jul 5, 2016
The raft log currently truncates based of the number of entries (max 10000).
These entries can be of any size so it makes more sense to use the byte size of
the raft log as a metric for truncation.

If the raft log is greater than 64MB and there is a behind node, it will
truncate the log to the quorum committed index.

The in-memory size is the approximate size in bytes of the persisted raft log.
On server restart, this value is assumed to be zero to avoid costly scans of the
raft log. After the first raft log truncation it will be correct.

See cockroachdb#7065.
d4l3k added a commit to d4l3k/cockroach that referenced this issue Jul 5, 2016
The raft log currently truncates based of the number of entries (max 10000).
These entries can be of any size so it makes more sense to use the byte size of
the raft log as a metric for truncation.

If the raft log is greater than 64MB and there is a behind node, it will
truncate the log to the quorum committed index.

The in-memory size is the approximate size in bytes of the persisted raft log.
On server restart, this value is assumed to be zero to avoid costly scans of the
raft log. After the first raft log truncation it will be correct.

See cockroachdb#7065.
d4l3k added a commit to d4l3k/cockroach that referenced this issue Jul 6, 2016
The raft log currently truncates based of the number of entries (max 10000).
These entries can be of any size so it makes more sense to use the byte size of
the raft log as a metric for truncation.

If the raft log is greater than 64MB and there is a behind node, it will
truncate the log to the quorum committed index.

The in-memory size is the approximate size in bytes of the persisted raft log.
On server restart, this value is assumed to be zero to avoid costly scans of the
raft log. After the first raft log truncation it will be correct.

See cockroachdb#7065.
d4l3k added a commit to d4l3k/cockroach that referenced this issue Jul 6, 2016
The raft log currently truncates based of the number of entries (max 10000).
These entries can be of any size so it makes more sense to use the byte size of
the raft log as a metric for truncation.

If the raft log is greater than 64MB and there is a behind node, it will
truncate the log to the quorum committed index.

The in-memory size is the approximate size in bytes of the persisted raft log.
On server restart, this value is assumed to be zero to avoid costly scans of the
raft log. After the first raft log truncation it will be correct.

See cockroachdb#7065.
@petermattis petermattis added this to the Q3 milestone Jul 11, 2016
@petermattis
Copy link
Collaborator Author

@d4l3k With #7438 merged, can this issue be closed?

@d4l3k d4l3k closed this as completed Jul 11, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants