Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: outside-of-raft snapshots put consistency in jeopardy #7619

Closed
tbg opened this issue Jul 5, 2016 · 4 comments
Closed

storage: outside-of-raft snapshots put consistency in jeopardy #7619

tbg opened this issue Jul 5, 2016 · 4 comments
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Milestone

Comments

@tbg
Copy link
Member

tbg commented Jul 5, 2016

When we send a snapshot around outside of Raft (preemptive snapshot), we don't check whether the state we're overwriting is ahead of us.
This could lead to a replica being created, acknowledging progress, and then being reset through a snapshot.

Perhaps canApplySnapshot is a good place to check that. The necessary information should be deducible from the respective HardStates.

@tbg tbg added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Jul 5, 2016
@bdarnell
Copy link
Contributor

bdarnell commented Jul 5, 2016

Specifically, there are some checks on the snapshot metadata in raft.restore which we don't do for preemptive snapshots. If there is data on disk we should check the term and index of the snapshot and drop it if it's older than what we already have.

Alternately, we could reorganize the checks in handleRaftMessage so we only call applySnapshot directly when we have no non-trivial local state, and pass the snapshot to r.raftGroup.Step otherwise to let raft make the decision.

@tbg
Copy link
Member Author

tbg commented Jul 6, 2016

I know this is assigned to @tamird, but it ties in with stuff for #7600, so I'm currently poking around in it too (let's avoid crossing streams).

@tamird
Copy link
Contributor

tamird commented Jul 6, 2016

OK. I already have a WIP for this:
master...tamird:preemptive-snap-no-backwards

On Wed, Jul 6, 2016 at 8:46 AM, Tobias Schottdorf [email protected]
wrote:

I know this is assigned to @tamird https://github.com/tamird, but it
ties in with stuff for #7600
#7600, so I'm currently
poking around in it too (let's avoid crossing streams).


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7619 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABdsPBHGYOn8jYtsE1SrIdPxJmMWgIQHks5qS6O9gaJpZM4JFQpb
.

tbg added a commit to tbg/cockroach that referenced this issue Jul 6, 2016
Ensure that a new HardState does not break promises made by an existing one
during snapshot application.
Update the HardState for both preemptive (where they're required both from an
updating and error checking perspective) and Raft-delivered snapshots (where it
must never error out and should never have to update anything since the
supplied HardState is fresh out of Raft).

Fixes cockroachdb#7619.
@tbg tbg assigned tbg and unassigned tamird Jul 6, 2016
@tbg
Copy link
Member Author

tbg commented Jul 6, 2016

Snagging this issue from you as believed fixed in #7598 (won't merge without a test for this added).

bdarnell added a commit to bdarnell/cockroach that referenced this issue Jul 7, 2016
tbg added a commit to tbg/cockroach that referenced this issue Jul 8, 2016
As discovered in
cockroachdb#6991 (comment),
it's possible that we apply a Raft snapshot without writing a corresponding
HardState since we write the snapshot in its own batch first and only then
write a HardState.

If that happens, the server is going to panic on restart: It will have a
nontrivial first index, but a committed index of zero (from the empty
HardState).

This change prevents us from applying a snapshot when there is no HardState
supplied along with it, except when applying a preemptive snapshot (in which
case we synthesize a HardState).

Ensure that a new HardState does not break promises made by an existing one
during preemptive snapshot application.

Fixes cockroachdb#7619.
tbg added a commit to tbg/cockroach that referenced this issue Jul 8, 2016
As discovered in
cockroachdb#6991 (comment),
it's possible that we apply a Raft snapshot without writing a corresponding
HardState since we write the snapshot in its own batch first and only then
write a HardState.

If that happens, the server is going to panic on restart: It will have a
nontrivial first index, but a committed index of zero (from the empty
HardState).

This change prevents us from applying a snapshot when there is no HardState
supplied along with it, except when applying a preemptive snapshot (in which
case we synthesize a HardState).

Ensure that a new HardState does not break promises made by an existing one
during preemptive snapshot application.

Fixes cockroachdb#7619.
tbg added a commit to tbg/cockroach that referenced this issue Jul 8, 2016
As discovered in
cockroachdb#6991 (comment),
it's possible that we apply a Raft snapshot without writing a corresponding
HardState since we write the snapshot in its own batch first and only then
write a HardState.

If that happens, the server is going to panic on restart: It will have a
nontrivial first index, but a committed index of zero (from the empty
HardState).

This change prevents us from applying a snapshot when there is no HardState
supplied along with it, except when applying a preemptive snapshot (in which
case we synthesize a HardState).

Ensure that a new HardState does not break promises made by an existing one
during preemptive snapshot application.

Fixes cockroachdb#7619.
tbg added a commit to tbg/cockroach that referenced this issue Jul 8, 2016
As discovered in
cockroachdb#6991 (comment),
it's possible that we apply a Raft snapshot without writing a corresponding
HardState since we write the snapshot in its own batch first and only then
write a HardState.

If that happens, the server is going to panic on restart: It will have a
nontrivial first index, but a committed index of zero (from the empty
HardState).

This change prevents us from applying a snapshot when there is no HardState
supplied along with it, except when applying a preemptive snapshot (in which
case we synthesize a HardState).

Ensure that a new HardState does not break promises made by an existing one
during preemptive snapshot application.

Fixes cockroachdb#7619.
tbg added a commit to tbg/cockroach that referenced this issue Jul 8, 2016
As discovered in
cockroachdb#6991 (comment),
it's possible that we apply a Raft snapshot without writing a corresponding
HardState since we write the snapshot in its own batch first and only then
write a HardState.

If that happens, the server is going to panic on restart: It will have a
nontrivial first index, but a committed index of zero (from the empty
HardState).

This change prevents us from applying a snapshot when there is no HardState
supplied along with it, except when applying a preemptive snapshot (in which
case we synthesize a HardState).

Ensure that the new HardState and Raft log does not break promises made by an
existing one during preemptive snapshot application.

Fixes cockroachdb#7619.

storage: prevent loss of uncommitted log entries
tbg added a commit to tbg/cockroach that referenced this issue Jul 11, 2016
As discovered in
cockroachdb#6991 (comment),
it's possible that we apply a Raft snapshot without writing a corresponding
HardState since we write the snapshot in its own batch first and only then
write a HardState.

If that happens, the server is going to panic on restart: It will have a
nontrivial first index, but a committed index of zero (from the empty
HardState).

This change prevents us from applying a snapshot when there is no HardState
supplied along with it, except when applying a preemptive snapshot (in which
case we synthesize a HardState).

Ensure that the new HardState and Raft log does not break promises made by an
existing one during preemptive snapshot application.

Fixes cockroachdb#7619.

storage: prevent loss of uncommitted log entries
@petermattis petermattis modified the milestone: Q3 Jul 11, 2016
@petermattis petermattis added the S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting label Jul 11, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Projects
None yet
Development

No branches or pull requests

4 participants