Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: preemptive snapshots #7819

Closed
petermattis opened this issue Jul 13, 2016 · 0 comments
Closed

storage: preemptive snapshots #7819

petermattis opened this issue Jul 13, 2016 · 0 comments
Assignees
Labels
S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Milestone

Comments

@petermattis
Copy link
Collaborator

Preemptive snapshots were disabled due to raft corruption issues. With #7619 fixed we should revisit.

The lack of preemptive snapshots makes the replica reservation system a lot less useful. We can see this in TestMultinodeCockroach which usually finishes within a few seconds, but sometimes takes 30-40s. The slow runs are caused by a declined reservation due to there being to many outstanding reservations. The default max outstanding reservations is 5 and TestMultinodeCockroach creates 6 ranges. The lack of preemptive snapshots causes replicateQueue.process to asynchronously perform the replication. Specifically, the call to Replica.ChangeReplicas returns before the reservation is fulfilled. The intention is that Replica.ChangeReplicas blocks until the preemptive snapshot has been processed remotely. Without that feedback, we quickly iterate over the replicas creating reservations until we hit defaultMaxReservations at which point processing fails and we have to wait until the next replica scanner iteration to replicate the range.

When reenabling preemptive snapshots, the sending of the snapshot in Replica.ChangeReplicas should be made synchronous. Currently we send the preemptive snapshot via the normal raft transport mechanism. But this only means the snapshot is queued up at the raft transport. The code then proceeds to issue the ChangeReplicas operation through Raft. Depending on timing, Raft might then request another snapshot. Regardless of this waste, the feedback loop for replication is being violated by sending the preemptive snapshot asynchronously.

See also #6144

@petermattis petermattis added this to the Q3 milestone Jul 15, 2016
@petermattis petermattis added the S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting label Jul 21, 2016
bdarnell added a commit to bdarnell/cockroach that referenced this issue Aug 17, 2016
Introduce a synchronous mode for RaftTransport.

Abort the ChangeReplicas operation if the snapshot cannot be sent. This
required relaxing some of the sanity checks introduced in cockroachdb#7833 in order
to get all the tests passing reliably. Notably, a replica with an active
raft group can now accept preemptive snapshots.

Fixes cockroachdb#7819
Closes cockroachdb#8604

May address cockroachdb#8007, cockroachdb#8056, and cockroachdb#8594
bdarnell added a commit to bdarnell/cockroach that referenced this issue Aug 17, 2016
Introduce a synchronous mode for RaftTransport.

Abort the ChangeReplicas operation if the snapshot cannot be sent. This
required relaxing some of the sanity checks introduced in cockroachdb#7833 in order
to get all the tests passing reliably. Notably, a replica with an active
raft group can now accept preemptive snapshots.

Fixes cockroachdb#7819
Closes cockroachdb#8604

May address cockroachdb#8007, cockroachdb#8056, and cockroachdb#8594
bdarnell added a commit to bdarnell/cockroach that referenced this issue Aug 17, 2016
Introduce a synchronous mode for RaftTransport.

Abort the ChangeReplicas operation if the snapshot cannot be sent. This
required relaxing some of the sanity checks introduced in cockroachdb#7833 in order
to get all the tests passing reliably. Notably, a replica with an active
raft group can now accept preemptive snapshots.

Fixes cockroachdb#7819
Closes cockroachdb#8604

May address cockroachdb#8007, cockroachdb#8056, and cockroachdb#8594
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Projects
None yet
Development

No branches or pull requests

2 participants