Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: avoid creating excessively large snapshots #7581

Closed
petermattis opened this issue Jul 1, 2016 · 8 comments
Closed

storage: avoid creating excessively large snapshots #7581

petermattis opened this issue Jul 1, 2016 · 8 comments
Assignees
Labels
S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Milestone

Comments

@petermattis
Copy link
Collaborator

We sometimes see out of memory errors when creating excessively large snapshots. For example, from #6991:

I160701 14:45:05.143248 storage/replica_raftstorage.go:524  generated snapshot for range 403 at index
 3533112 in 31.747833551s. encoded size=1072891308, 6966 KV pairs, 1677671 log entries
fatal error: runtime: out of memory

Notice the huge raft log. Because the raft log is part of the replicated state we must send all of the raft log with the snapshot in order to avoid divergence of the new replica. (@bdarnell Perhaps the applied portion of the raft log should not be considered during consistency checks).

It should be possible to add a failsafe to Replica.snapshot() so that if we see a very large snapshot is being created we return raft.ErrSnapshotTemporarilyUnavailable and possibly add the replica to the raft-log-gc queue.

@tbg
Copy link
Member

tbg commented Jul 1, 2016

The Raft log actually isn't part of the replicated state. The part which
needs to be replicated is the truncated state (which is used to compute the
first index). It seems that we could get away with not sending the large
suffix when it is largely uncommitted. We probably can't artificially lower
the committed index (which would allow us to remove a tail of committed
commands, too) because on the receiving side, we might be overwriting a
state which has acknowledged entries we're now overwriting.

On Fri, Jul 1, 2016 at 11:06 AM Peter Mattis [email protected]
wrote:

We sometimes see out of memory errors when creating excessively large
snapshots. For example, from #6991
#6991:

I160701 14:45:05.143248 storage/replica_raftstorage.go:524 generated snapshot for range 403 at index
3533112 in 31.747833551s. encoded size=1072891308, 6966 KV pairs, 1677671 log entries
fatal error: runtime: out of memory

Notice the huge raft log. Because the raft log is part of the replicated
state we must send all of the raft log with the snapshot in order to avoid
divergence of the new replica. (@bdarnell https://github.com/bdarnell
Perhaps the applied portion of the raft log should not be considered during
consistency checks).

It should be possible to add a failsafe to Replica.snapshot() so that if
we see a very large snapshot is being created we return
raft.ErrSnapshotTemporarilyUnavailable and possibly add the replica to
the raft-log-gc queue.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#7581, or mute the thread
https://github.com/notifications/unsubscribe/AE135K5ruAFKlxenvdyzFro0JHiq7mZFks5qRSzpgaJpZM4JDOgY
.

-- Tobias

@petermattis
Copy link
Collaborator Author

Oh, I completely misread RaftLogKey. You're correct that the raft-log isn't part of the replicated state. Seems like it should be straightforward to not send the applied portion of the raft log.

@tbg
Copy link
Member

tbg commented Jul 1, 2016

Hmm, I just checked and we only send the applied log entries. Bummer, doesn't seem like there's a corner we can cut unless we make the truncated state unreplicated (I'd rather not).

@petermattis
Copy link
Collaborator Author

Eh, looks like snapshot() iterates over the raft log from truncState.Index + 1 to appliedIndex + 1. Doesn't that contain all of the applied entries?

@tbg
Copy link
Member

tbg commented Jul 1, 2016

yes, all of those that haven't been truncated (which in an ideal world is very few due to aggressive Raft log truncation). My point about truncatedState is that we could easily not send all of those entries if we could modify the truncatedState, but that is replicated. Maybe we want to change that, but I haven't thought it through. If there's no downside to it elsewhere, the benefit for snapshots would definitely make it worthwhile considering it.

@petermattis
Copy link
Collaborator Author

Got it. So the raft log entries are not part of the truncated state, but the index we've truncated to is.

@petermattis
Copy link
Collaborator Author

@dt Adding a failsafe to Replica.snapshot() as mentioned in the original message should be straightforward. As usual, probably the most difficult part will be adding a test.

@petermattis petermattis added the S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting label Jul 11, 2016
@petermattis petermattis modified the milestone: Q3 Jul 11, 2016
dt added a commit to dt/cockroach that referenced this issue Jul 12, 2016
If a range needs to be split, return an err rather than attempting to generate a snapshot.
This avoids generating excessively large snapshots.

Suggested in cockroachdb#7581.
dt added a commit to dt/cockroach that referenced this issue Jul 12, 2016
If a range needs to be split, return an err rather than attempting to generate a snapshot.
This avoids generating excessively large snapshots.

Suggested in cockroachdb#7581.
dt added a commit to dt/cockroach that referenced this issue Jul 12, 2016
If a range needs to be split, return an err rather than attempting to generate a snapshot.
This avoids generating excessively large snapshots.

Suggested in cockroachdb#7581.
dt added a commit to dt/cockroach that referenced this issue Jul 12, 2016
If a range needs to be split, return an err rather than attempting to generate a snapshot.
This avoids generating excessively large snapshots.

Suggested in cockroachdb#7581.
dt added a commit to dt/cockroach that referenced this issue Jul 14, 2016
If a range needs to be split, return an err rather than attempting to generate a snapshot.
This avoids generating excessively large snapshots.

Suggested in cockroachdb#7581.
dt added a commit to dt/cockroach that referenced this issue Jul 18, 2016
If a range needs to be split, return an err rather than attempting to generate a snapshot.
This avoids generating excessively large snapshots.

Suggested in cockroachdb#7581.
dt added a commit to dt/cockroach that referenced this issue Jul 18, 2016
If a range needs to be split, return an err rather than attempting to generate a snapshot.
This avoids generating excessively large snapshots.

Suggested in cockroachdb#7581.
dt added a commit to dt/cockroach that referenced this issue Jul 18, 2016
If a range needs to be split, return an err rather than attempting to generate a snapshot.
This avoids generating excessively large snapshots.

Suggested in cockroachdb#7581.
@dt
Copy link
Member

dt commented Jul 18, 2016

#7788

@dt dt closed this as completed Jul 18, 2016
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Dec 9, 2017
In a privately reported user issue, we've seen that [our attempts](cockroachdb#7788)
at [preventing large snapshots](cockroachdb#7581)
can result in replica unavailability. Our current approach to limiting
large snapshots assumes is that its ok to block snapshots indefinitely
while waiting for a range to first split. Unfortunately, this can create
a dependency cycle where a range requires a snapshot to split (because it
can't achieve an up-to-date quorum without it) but isn't allowed to perform
a snapshot until its size is reduced below the threshold. This can result
in unavailability even when a majority of replicas remain live.

Currently, we still need this snapshot size limit because unbounded snapshots
can result in OOM errors that crash entire nodes. However, once snapshots
are streamed from disk to disk, never needing to buffer in-memory on the
sending or receiving side, we should be able to remove any snapshot size
limit (see cockroachdb#16954).

As a holdover, this change introduces a `permitLargeSnapshots` flag on a
replica which is set when the replica is too large to snapshot but observes
splits failing. When set, the flag allows snapshots to ignore the size
limit until the snapshot goes through and splits are able to succeed
again.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Dec 12, 2017
In a privately reported user issue, we've seen that [our attempts](cockroachdb#7788)
at [preventing large snapshots](cockroachdb#7581)
can result in replica unavailability. Our current approach to limiting
large snapshots assumes is that its ok to block snapshots indefinitely
while waiting for a range to first split. Unfortunately, this can create
a dependency cycle where a range requires a snapshot to split (because it
can't achieve an up-to-date quorum without it) but isn't allowed to perform
a snapshot until its size is reduced below the threshold. This can result
in unavailability even when a majority of replicas remain live.

Currently, we still need this snapshot size limit because unbounded snapshots
can result in OOM errors that crash entire nodes. However, once snapshots
are streamed from disk to disk, never needing to buffer in-memory on the
sending or receiving side, we should be able to remove any snapshot size
limit (see cockroachdb#16954).

As a holdover, this change introduces a `permitLargeSnapshots` flag on a
replica which is set when the replica is too large to snapshot but observes
splits failing. When set, the flag allows snapshots to ignore the size
limit until the snapshot goes through and splits are able to succeed
again.

Release note (bug fix): Fixed a scenario where a range that is too big
to snapshot can lose availability even with a majority of nodes alive.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Dec 15, 2017
In a privately reported user issue, we've seen that [our attempts](cockroachdb#7788)
at [preventing large snapshots](cockroachdb#7581)
can result in replica unavailability. Our current approach to limiting
large snapshots assumes is that its ok to block snapshots indefinitely
while waiting for a range to first split. Unfortunately, this can create
a dependency cycle where a range requires a snapshot to split (because it
can't achieve an up-to-date quorum without it) but isn't allowed to perform
a snapshot until its size is reduced below the threshold. This can result
in unavailability even when a majority of replicas remain live.

Currently, we still need this snapshot size limit because unbounded snapshots
can result in OOM errors that crash entire nodes. However, once snapshots
are streamed from disk to disk, never needing to buffer in-memory on the
sending or receiving side, we should be able to remove any snapshot size
limit (see cockroachdb#16954).

As a holdover, this change introduces a `permitLargeSnapshots` flag on a
replica which is set when the replica is too large to snapshot but observes
splits failing. When set, the flag allows snapshots to ignore the size
limit until the snapshot goes through and splits are able to succeed
again.

Release note (bug fix): Fixed a scenario where a range that is too big
to snapshot can lose availability even with a majority of nodes alive.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Dec 15, 2017
In a privately reported user issue, we've seen that [our attempts](cockroachdb#7788)
at [preventing large snapshots](cockroachdb#7581)
can result in replica unavailability. Our current approach to limiting
large snapshots assumes is that its ok to block snapshots indefinitely
while waiting for a range to first split. Unfortunately, this can create
a dependency cycle where a range requires a snapshot to split (because it
can't achieve an up-to-date quorum without it) but isn't allowed to perform
a snapshot until its size is reduced below the threshold. This can result
in unavailability even when a majority of replicas remain live.

Currently, we still need this snapshot size limit because unbounded snapshots
can result in OOM errors that crash entire nodes. However, once snapshots
are streamed from disk to disk, never needing to buffer in-memory on the
sending or receiving side, we should be able to remove any snapshot size
limit (see cockroachdb#16954).

As a holdover, this change introduces a `permitLargeSnapshots` flag on a
replica which is set when the replica is too large to snapshot but observes
splits failing. When set, the flag allows snapshots to ignore the size
limit until the snapshot goes through and splits are able to succeed
again.

Release note (bug fix): Fixed a scenario where a range that is too big
to snapshot can lose availability even with a majority of nodes alive.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Dec 20, 2017
In a privately reported user issue, we've seen that [our attempts](cockroachdb#7788)
at [preventing large snapshots](cockroachdb#7581)
can result in replica unavailability. Our current approach to limiting
large snapshots assumes is that its ok to block snapshots indefinitely
while waiting for a range to first split. Unfortunately, this can create
a dependency cycle where a range requires a snapshot to split (because it
can't achieve an up-to-date quorum without it) but isn't allowed to perform
a snapshot until its size is reduced below the threshold. This can result
in unavailability even when a majority of replicas remain live.

Currently, we still need this snapshot size limit because unbounded snapshots
can result in OOM errors that crash entire nodes. However, once snapshots
are streamed from disk to disk, never needing to buffer in-memory on the
sending or receiving side, we should be able to remove any snapshot size
limit (see cockroachdb#16954).

As a holdover, this change introduces a `permitLargeSnapshots` flag on a
replica which is set when the replica is too large to snapshot but observes
splits failing. When set, the flag allows snapshots to ignore the size
limit until the snapshot goes through and splits are able to succeed
again.

Release note (bug fix): Fixed a scenario where a range that is too big
to snapshot can lose availability even with a majority of nodes alive.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Projects
None yet
Development

No branches or pull requests

3 participants