-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: avoid creating excessively large snapshots #7581
Comments
The Raft log actually isn't part of the replicated state. The part which On Fri, Jul 1, 2016 at 11:06 AM Peter Mattis [email protected]
-- Tobias |
Oh, I completely misread |
Hmm, I just checked and we only send the applied log entries. Bummer, doesn't seem like there's a corner we can cut unless we make the truncated state unreplicated (I'd rather not). |
Eh, looks like |
yes, all of those that haven't been truncated (which in an ideal world is very few due to aggressive Raft log truncation). My point about truncatedState is that we could easily not send all of those entries if we could modify the truncatedState, but that is replicated. Maybe we want to change that, but I haven't thought it through. If there's no downside to it elsewhere, the benefit for snapshots would definitely make it worthwhile considering it. |
Got it. So the raft log entries are not part of the truncated state, but the index we've truncated to is. |
@dt Adding a failsafe to |
If a range needs to be split, return an err rather than attempting to generate a snapshot. This avoids generating excessively large snapshots. Suggested in cockroachdb#7581.
If a range needs to be split, return an err rather than attempting to generate a snapshot. This avoids generating excessively large snapshots. Suggested in cockroachdb#7581.
If a range needs to be split, return an err rather than attempting to generate a snapshot. This avoids generating excessively large snapshots. Suggested in cockroachdb#7581.
If a range needs to be split, return an err rather than attempting to generate a snapshot. This avoids generating excessively large snapshots. Suggested in cockroachdb#7581.
If a range needs to be split, return an err rather than attempting to generate a snapshot. This avoids generating excessively large snapshots. Suggested in cockroachdb#7581.
If a range needs to be split, return an err rather than attempting to generate a snapshot. This avoids generating excessively large snapshots. Suggested in cockroachdb#7581.
If a range needs to be split, return an err rather than attempting to generate a snapshot. This avoids generating excessively large snapshots. Suggested in cockroachdb#7581.
If a range needs to be split, return an err rather than attempting to generate a snapshot. This avoids generating excessively large snapshots. Suggested in cockroachdb#7581.
In a privately reported user issue, we've seen that [our attempts](cockroachdb#7788) at [preventing large snapshots](cockroachdb#7581) can result in replica unavailability. Our current approach to limiting large snapshots assumes is that its ok to block snapshots indefinitely while waiting for a range to first split. Unfortunately, this can create a dependency cycle where a range requires a snapshot to split (because it can't achieve an up-to-date quorum without it) but isn't allowed to perform a snapshot until its size is reduced below the threshold. This can result in unavailability even when a majority of replicas remain live. Currently, we still need this snapshot size limit because unbounded snapshots can result in OOM errors that crash entire nodes. However, once snapshots are streamed from disk to disk, never needing to buffer in-memory on the sending or receiving side, we should be able to remove any snapshot size limit (see cockroachdb#16954). As a holdover, this change introduces a `permitLargeSnapshots` flag on a replica which is set when the replica is too large to snapshot but observes splits failing. When set, the flag allows snapshots to ignore the size limit until the snapshot goes through and splits are able to succeed again. Release note: None
In a privately reported user issue, we've seen that [our attempts](cockroachdb#7788) at [preventing large snapshots](cockroachdb#7581) can result in replica unavailability. Our current approach to limiting large snapshots assumes is that its ok to block snapshots indefinitely while waiting for a range to first split. Unfortunately, this can create a dependency cycle where a range requires a snapshot to split (because it can't achieve an up-to-date quorum without it) but isn't allowed to perform a snapshot until its size is reduced below the threshold. This can result in unavailability even when a majority of replicas remain live. Currently, we still need this snapshot size limit because unbounded snapshots can result in OOM errors that crash entire nodes. However, once snapshots are streamed from disk to disk, never needing to buffer in-memory on the sending or receiving side, we should be able to remove any snapshot size limit (see cockroachdb#16954). As a holdover, this change introduces a `permitLargeSnapshots` flag on a replica which is set when the replica is too large to snapshot but observes splits failing. When set, the flag allows snapshots to ignore the size limit until the snapshot goes through and splits are able to succeed again. Release note (bug fix): Fixed a scenario where a range that is too big to snapshot can lose availability even with a majority of nodes alive.
In a privately reported user issue, we've seen that [our attempts](cockroachdb#7788) at [preventing large snapshots](cockroachdb#7581) can result in replica unavailability. Our current approach to limiting large snapshots assumes is that its ok to block snapshots indefinitely while waiting for a range to first split. Unfortunately, this can create a dependency cycle where a range requires a snapshot to split (because it can't achieve an up-to-date quorum without it) but isn't allowed to perform a snapshot until its size is reduced below the threshold. This can result in unavailability even when a majority of replicas remain live. Currently, we still need this snapshot size limit because unbounded snapshots can result in OOM errors that crash entire nodes. However, once snapshots are streamed from disk to disk, never needing to buffer in-memory on the sending or receiving side, we should be able to remove any snapshot size limit (see cockroachdb#16954). As a holdover, this change introduces a `permitLargeSnapshots` flag on a replica which is set when the replica is too large to snapshot but observes splits failing. When set, the flag allows snapshots to ignore the size limit until the snapshot goes through and splits are able to succeed again. Release note (bug fix): Fixed a scenario where a range that is too big to snapshot can lose availability even with a majority of nodes alive.
In a privately reported user issue, we've seen that [our attempts](cockroachdb#7788) at [preventing large snapshots](cockroachdb#7581) can result in replica unavailability. Our current approach to limiting large snapshots assumes is that its ok to block snapshots indefinitely while waiting for a range to first split. Unfortunately, this can create a dependency cycle where a range requires a snapshot to split (because it can't achieve an up-to-date quorum without it) but isn't allowed to perform a snapshot until its size is reduced below the threshold. This can result in unavailability even when a majority of replicas remain live. Currently, we still need this snapshot size limit because unbounded snapshots can result in OOM errors that crash entire nodes. However, once snapshots are streamed from disk to disk, never needing to buffer in-memory on the sending or receiving side, we should be able to remove any snapshot size limit (see cockroachdb#16954). As a holdover, this change introduces a `permitLargeSnapshots` flag on a replica which is set when the replica is too large to snapshot but observes splits failing. When set, the flag allows snapshots to ignore the size limit until the snapshot goes through and splits are able to succeed again. Release note (bug fix): Fixed a scenario where a range that is too big to snapshot can lose availability even with a majority of nodes alive.
In a privately reported user issue, we've seen that [our attempts](cockroachdb#7788) at [preventing large snapshots](cockroachdb#7581) can result in replica unavailability. Our current approach to limiting large snapshots assumes is that its ok to block snapshots indefinitely while waiting for a range to first split. Unfortunately, this can create a dependency cycle where a range requires a snapshot to split (because it can't achieve an up-to-date quorum without it) but isn't allowed to perform a snapshot until its size is reduced below the threshold. This can result in unavailability even when a majority of replicas remain live. Currently, we still need this snapshot size limit because unbounded snapshots can result in OOM errors that crash entire nodes. However, once snapshots are streamed from disk to disk, never needing to buffer in-memory on the sending or receiving side, we should be able to remove any snapshot size limit (see cockroachdb#16954). As a holdover, this change introduces a `permitLargeSnapshots` flag on a replica which is set when the replica is too large to snapshot but observes splits failing. When set, the flag allows snapshots to ignore the size limit until the snapshot goes through and splits are able to succeed again. Release note (bug fix): Fixed a scenario where a range that is too big to snapshot can lose availability even with a majority of nodes alive.
We sometimes see out of memory errors when creating excessively large snapshots. For example, from #6991:
Notice the huge raft log. Because the raft log is part of the replicated state we must send all of the raft log with the snapshot in order to avoid divergence of the new replica. (@bdarnell Perhaps the applied portion of the raft log should not be considered during consistency checks).
It should be possible to add a failsafe to
Replica.snapshot()
so that if we see a very large snapshot is being created we returnraft.ErrSnapshotTemporarilyUnavailable
and possibly add the replica to the raft-log-gc queue.The text was updated successfully, but these errors were encountered: