storage: avoid creating excessively large snapshots #7581

petermattis · 2016-07-01T15:06:05Z

We sometimes see out of memory errors when creating excessively large snapshots. For example, from #6991:

I160701 14:45:05.143248 storage/replica_raftstorage.go:524  generated snapshot for range 403 at index
 3533112 in 31.747833551s. encoded size=1072891308, 6966 KV pairs, 1677671 log entries
fatal error: runtime: out of memory

Notice the huge raft log. Because the raft log is part of the replicated state we must send all of the raft log with the snapshot in order to avoid divergence of the new replica. (@bdarnell Perhaps the applied portion of the raft log should not be considered during consistency checks).

It should be possible to add a failsafe to Replica.snapshot() so that if we see a very large snapshot is being created we return raft.ErrSnapshotTemporarilyUnavailable and possibly add the replica to the raft-log-gc queue.

The text was updated successfully, but these errors were encountered:

tbg · 2016-07-01T15:10:52Z

The Raft log actually isn't part of the replicated state. The part which
needs to be replicated is the truncated state (which is used to compute the
first index). It seems that we could get away with not sending the large
suffix when it is largely uncommitted. We probably can't artificially lower
the committed index (which would allow us to remove a tail of committed
commands, too) because on the receiving side, we might be overwriting a
state which has acknowledged entries we're now overwriting.

On Fri, Jul 1, 2016 at 11:06 AM Peter Mattis [email protected]
wrote:

We sometimes see out of memory errors when creating excessively large
snapshots. For example, from #6991
#6991:

I160701 14:45:05.143248 storage/replica_raftstorage.go:524 generated snapshot for range 403 at index
3533112 in 31.747833551s. encoded size=1072891308, 6966 KV pairs, 1677671 log entries
fatal error: runtime: out of memory

Notice the huge raft log. Because the raft log is part of the replicated
state we must send all of the raft log with the snapshot in order to avoid
divergence of the new replica. (@bdarnell https://github.com/bdarnell
Perhaps the applied portion of the raft log should not be considered during
consistency checks).

It should be possible to add a failsafe to Replica.snapshot() so that if
we see a very large snapshot is being created we return
raft.ErrSnapshotTemporarilyUnavailable and possibly add the replica to
the raft-log-gc queue.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#7581, or mute the thread
https://github.com/notifications/unsubscribe/AE135K5ruAFKlxenvdyzFro0JHiq7mZFks5qRSzpgaJpZM4JDOgY
.

-- Tobias

petermattis · 2016-07-01T15:13:57Z

Oh, I completely misread RaftLogKey. You're correct that the raft-log isn't part of the replicated state. Seems like it should be straightforward to not send the applied portion of the raft log.

tbg · 2016-07-01T15:13:59Z

Hmm, I just checked and we only send the applied log entries. Bummer, doesn't seem like there's a corner we can cut unless we make the truncated state unreplicated (I'd rather not).

petermattis · 2016-07-01T15:20:25Z

Eh, looks like snapshot() iterates over the raft log from truncState.Index + 1 to appliedIndex + 1. Doesn't that contain all of the applied entries?

tbg · 2016-07-01T15:23:27Z

yes, all of those that haven't been truncated (which in an ideal world is very few due to aggressive Raft log truncation). My point about truncatedState is that we could easily not send all of those entries if we could modify the truncatedState, but that is replicated. Maybe we want to change that, but I haven't thought it through. If there's no downside to it elsewhere, the benefit for snapshots would definitely make it worthwhile considering it.

petermattis · 2016-07-01T15:24:57Z

Got it. So the raft log entries are not part of the truncated state, but the index we've truncated to is.

petermattis · 2016-07-01T18:22:23Z

@dt Adding a failsafe to Replica.snapshot() as mentioned in the original message should be straightforward. As usual, probably the most difficult part will be adding a test.

If a range needs to be split, return an err rather than attempting to generate a snapshot. This avoids generating excessively large snapshots. Suggested in cockroachdb#7581.

dt · 2016-07-18T20:44:57Z

#7788

In a privately reported user issue, we've seen that [our attempts](cockroachdb#7788) at [preventing large snapshots](cockroachdb#7581) can result in replica unavailability. Our current approach to limiting large snapshots assumes is that its ok to block snapshots indefinitely while waiting for a range to first split. Unfortunately, this can create a dependency cycle where a range requires a snapshot to split (because it can't achieve an up-to-date quorum without it) but isn't allowed to perform a snapshot until its size is reduced below the threshold. This can result in unavailability even when a majority of replicas remain live. Currently, we still need this snapshot size limit because unbounded snapshots can result in OOM errors that crash entire nodes. However, once snapshots are streamed from disk to disk, never needing to buffer in-memory on the sending or receiving side, we should be able to remove any snapshot size limit (see cockroachdb#16954). As a holdover, this change introduces a `permitLargeSnapshots` flag on a replica which is set when the replica is too large to snapshot but observes splits failing. When set, the flag allows snapshots to ignore the size limit until the snapshot goes through and splits are able to succeed again. Release note: None

In a privately reported user issue, we've seen that [our attempts](cockroachdb#7788) at [preventing large snapshots](cockroachdb#7581) can result in replica unavailability. Our current approach to limiting large snapshots assumes is that its ok to block snapshots indefinitely while waiting for a range to first split. Unfortunately, this can create a dependency cycle where a range requires a snapshot to split (because it can't achieve an up-to-date quorum without it) but isn't allowed to perform a snapshot until its size is reduced below the threshold. This can result in unavailability even when a majority of replicas remain live. Currently, we still need this snapshot size limit because unbounded snapshots can result in OOM errors that crash entire nodes. However, once snapshots are streamed from disk to disk, never needing to buffer in-memory on the sending or receiving side, we should be able to remove any snapshot size limit (see cockroachdb#16954). As a holdover, this change introduces a `permitLargeSnapshots` flag on a replica which is set when the replica is too large to snapshot but observes splits failing. When set, the flag allows snapshots to ignore the size limit until the snapshot goes through and splits are able to succeed again. Release note (bug fix): Fixed a scenario where a range that is too big to snapshot can lose availability even with a majority of nodes alive.

petermattis assigned dt Jul 1, 2016

petermattis added the S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting label Jul 11, 2016

petermattis modified the milestone: Q3 Jul 11, 2016

dt closed this as completed Jul 18, 2016

dt mentioned this issue Sep 26, 2016

stability: drastically reducing range_max_bytes on a running cluster causes stall #9545

Closed

nvanbenschoten mentioned this issue Apr 24, 2018

storage: add permitLargeSnapshots flag to replica #20589

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: avoid creating excessively large snapshots #7581

storage: avoid creating excessively large snapshots #7581

petermattis commented Jul 1, 2016

tbg commented Jul 1, 2016

petermattis commented Jul 1, 2016

tbg commented Jul 1, 2016

petermattis commented Jul 1, 2016

tbg commented Jul 1, 2016

petermattis commented Jul 1, 2016

petermattis commented Jul 1, 2016

dt commented Jul 18, 2016

storage: avoid creating excessively large snapshots #7581

storage: avoid creating excessively large snapshots #7581

Comments

petermattis commented Jul 1, 2016

tbg commented Jul 1, 2016

petermattis commented Jul 1, 2016

tbg commented Jul 1, 2016

petermattis commented Jul 1, 2016

tbg commented Jul 1, 2016

petermattis commented Jul 1, 2016

petermattis commented Jul 1, 2016

dt commented Jul 18, 2016