Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: Asynchronous snapshots #6204

Closed
bdarnell opened this issue Apr 21, 2016 · 0 comments
Closed

storage: Asynchronous snapshots #6204

bdarnell opened this issue Apr 21, 2016 · 0 comments
Assignees
Milestone

Comments

@bdarnell
Copy link
Contributor

Now that Replica.Snapshot is safely isolated from the rest of the Replica (#6187), we can use etcd/raft's asynchronous snapshot feature. This will avoid blocking the processRaft goroutine during the expensive snapshot generation.

The first time Snapshot() is called, it should start a goroutine and immediately return raft.ErrSnapshotTemporarilyUnavailable. Raft will occasionally poll for results by calling Snapshot; we can return the result when we have it; until then we return ErrSnapshotTemporarilyUnavailable again (without starting new goroutines).

A few subtle points:

  • In rare cases raft may decide it doesn't need the snapshot, so we should be sure to discard snapshots that go unused for too long.
  • When expanding the replication factor (e.g from 1 to 3 or 3 to 5) we may be able to reuse the same snapshot twice, although It's probably not worth optimizing for this case.
  • After starting the goroutine, it might be best to wait with a short timeout so that we can get the snapshot in a single attempt when it's small. This will be especially important to keep the tests fast.
@bdarnell bdarnell added this to the Q2 milestone Apr 21, 2016
@bdarnell bdarnell self-assigned this Apr 22, 2016
bdarnell added a commit to bdarnell/cockroach that referenced this issue Apr 22, 2016
Blocking the processRaft goroutine for too long is problematic. In
extreme cases it can cause heartbeats to be missed and new elections to
start (a major cause of cockroachdb#5970). This commit moves the work of snapshot
generation to an asynchronous goroutine.

Fixes cockroachdb#6204.
bdarnell added a commit to bdarnell/cockroach that referenced this issue Apr 25, 2016
Blocking the processRaft goroutine for too long is problematic. In
extreme cases it can cause heartbeats to be missed and new elections to
start (a major cause of cockroachdb#5970). This commit moves the work of snapshot
generation to an asynchronous goroutine.

Fixes cockroachdb#6204.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant