Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: fix preemptive snapshots #7468

Merged

Conversation

petermattis
Copy link
Collaborator

@petermattis petermattis commented Jun 24, 2016

Allow Replicas to be created which do not have their Raft replica ID
configured. These Replicas only allow snapshots to be applied. When
applying snapshots to these replicas, we initialize the raft group state
similarly to splitting a range.

Fixes #7372.


This change is Reviewable

@petermattis
Copy link
Collaborator Author

Seems like this mostly works. I still need to track down the TestReplicateAfterTruncation failure. And also figure out if I can write a test that the preemptive snapshots are actually be applied correctly (currently I've only verified that they are via logs).

@tamird
Copy link
Contributor

tamird commented Jun 24, 2016

I went down a similar road investigating the handling of zero replica IDs -- it's a mess, and this change is making it worse. What are the semantics around the zero ID now? It's allowed, but only on snapshots (and means pre-emptive snapshot), but also it's required when the replica is not already initialized, because...reasons.

I don't want to hold this change up, but we need to clarify this situation. Reading this diff I can't tell that it's correct because it's delaying or suppressing errors, seemingly at random.


Reviewed 3 of 3 files at r1.
Review status: all files reviewed at latest revision, 2 unresolved discussions, some commit checks pending.


storage/replica_raftstorage.go, line 604 [r1] (raw file):

      ms, err := writeInitialState(batch, s.Stats, desc)
      if err != nil {
          return 0, errors.Errorf("unable to write initial state: %s", err)

errors.Wrap


storage/store.go, line 2084 [r1] (raw file):

  r, err := s.getOrCreateReplicaLocked(req.GroupID, req.ToReplica.ReplicaID,
      req.FromReplica)
  uninitialized := err != nil || r.mu.replicaID == 0

i think this is equivalent to uninitialized := r.mu.replicaID == 0


Comments from Reviewable

@petermattis
Copy link
Collaborator Author

I think this change will need to wait for @bdarnell's input regardless, so no worries about holding it up.

Yes, the semantics of the replica ID 0 are that it is allowed only on snapshots and we only apply them preemptively if the replica is not part of the raft group yet.

This isn't a large change. Which errors are you worried about being suppressed? Note that calling withRaftGroupLocked when r.mu.replicaID == 0 previously resulted in a panic. So that wasn't a code path that was occurring before (or we'd be seeing panics in tests and in the beta cluster).

I'm still learning about the intricacies of the raft layer. It is very likely I'm doing something silly here (as witnessed by some of my recent PRs).


Review status: all files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.


storage/replica_raftstorage.go, line 604 [r1] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

errors.Wrap

Done.

storage/store.go, line 2084 [r1] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

i think this is equivalent to uninitialized := r.mu.replicaID == 0

It isn't. If `err != nil` then `r == nil` and we don't want to dereference `r`.

Comments from Reviewable

@tamird
Copy link
Contributor

tamird commented Jun 24, 2016

The complication I was referring to is the additional meaning of replica ID zero which was added in 22bbc91.


Review status: all files reviewed at latest revision, 3 unresolved discussions, some commit checks failed.


storage/replica_raftstorage.go, line 604 [r1] (raw file):

Previously, petermattis (Peter Mattis) wrote…

Done.

?

storage/store.go, line 2084 [r1] (raw file):

Previously, petermattis (Peter Mattis) wrote…

It isn't. If err != nil then r == nil and we don't want to dereference r.

i think it should move below the existing error check. looks like it's up here because you wanted to reuse the same log line, but that long line is awkward with a nil error anyway, and the comment explaining it is somewhat misleading.

storage/store.go, line 2104 [r1] (raw file):

  }

  if uninitialized && (req.Message.Type == raftpb.MsgSnap) {

spurious parens


Comments from Reviewable

@petermattis petermattis force-pushed the pmattis/preemptive-snapshots branch from eb2a486 to e6b7d66 Compare June 24, 2016 19:00
@petermattis
Copy link
Collaborator Author

Review status: 2 of 3 files reviewed at latest revision, 3 unresolved discussions, some commit checks pending.


storage/replica_raftstorage.go, line 604 [r1] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

?

Oops, uploaded now.

storage/store.go, line 2084 [r1] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

i think it should move below the existing error check. looks like it's up here because you wanted to reuse the same log line, but that long line is awkward with a nil error anyway, and the comment explaining it is somewhat misleading.

It's up here because I wanted to be holding `r.mu`, but I just noticed I'm not holding `r.mu`. Doh!

Maybe this should be req.ToReplica.ReplicaID.


storage/store.go, line 2104 [r1] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

spurious parens

I wasn't aware we had a coding guideline about that.

Comments from Reviewable

@petermattis petermattis force-pushed the pmattis/preemptive-snapshots branch from e6b7d66 to ec54e07 Compare June 24, 2016 19:10
@petermattis
Copy link
Collaborator Author

Review status: 1 of 3 files reviewed at latest revision, 3 unresolved discussions, some commit checks pending.


storage/store.go, line 2104 [r1] (raw file):

Previously, petermattis (Peter Mattis) wrote…

I wasn't aware we had a coding guideline about that.

Done.

I reworked the conditions here. Should be more palatable.


Comments from Reviewable

@tamird
Copy link
Contributor

tamird commented Jun 24, 2016

Reviewed 2 of 2 files at r2.
Review status: all files reviewed at latest revision, all discussions resolved, some commit checks pending.


Comments from Reviewable

@petermattis
Copy link
Collaborator Author

Yes, this changes the semantics of replica ID 0. Definitely need @bdarnell to weigh in on if this is kosher or if another approach would be better (e.g. a new field indicating a preemptive snapshot has been applied).


Review status: all files reviewed at latest revision, all discussions resolved, some commit checks failed.


Comments from Reviewable

@tbg
Copy link
Member

tbg commented Jun 24, 2016

Haven't reviewed closely, but I share @tamirds worry that we're putting more complications and random exceptions in more code paths that are already pretty tangled up and hard to reason about for anyone but @bdarnell. That said, I'm glad that we've got ~4 pairs of eyes on this part of the system now. I removed a bunch of complexity in #7310 (at the price of a missed release, sadly) and follow-ups, and I hope that this trend continues and that the complexity which must remain is more explicitly called out and documented. For example, I had to think about whether a snapshot contained a HardState, similar to how during #7310 I had to puzzle together what Raft state is written when in what situation. We should document that stuff more and especially the reasoning behind it.


Review status: all files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.


storage/replica_raftstorage.go, line 602 [r2] (raw file):

      // The replica is not part of the Raft group so we need to write the
      // initial raft state for the replica.
      ms, err := writeInitialState(batch, s.Stats, desc)

Discussed with @tamird yesterday but also wanted to throw this out here: There was some subtlety about having on-disk state without an associated in-memory replica (git blame the eager removal of stale on-disk data in the store startup loop for archeology; I don't remember what the problem was right now). Maybe this is negating that effort, maybe not. (Whichever it is, there may be something there that should be documented as well).


Comments from Reviewable

@petermattis
Copy link
Collaborator Author

Review status: all files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.


storage/replica_raftstorage.go, line 602 [r2] (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Discussed with @tamird yesterday but also wanted to throw this out here: There was some subtlety about having on-disk state without an associated in-memory replica (git blame the eager removal of stale on-disk data in the store startup loop for archeology; I don't remember what the problem was right now). Maybe this is negating that effort, maybe not. (Whichever it is, there may be something there that should be documented as well).

Something isn't correct here. I still haven't been able to track down the `TestReplicateAfterTruncation` which is due to a discrepancy in the MVCC stats before and after the replication.

Comments from Reviewable

@petermattis petermattis force-pushed the pmattis/preemptive-snapshots branch 2 times, most recently from f9d4299 to df0526a Compare June 27, 2016 15:15
@petermattis
Copy link
Collaborator Author

This is ready for another look. The tests all pass. As far as I can tell from logs the preemptive snapshots are working. I'm taking a look at whether we can unit test that preemptive snapshots are being applied. And I'm going to be reading through the commits @tamird and @tschottdorf mentioned. Definitely still a novice in this area of the code.


Review status: 0 of 4 files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.


storage/replica_raftstorage.go, line 602 [r2] (raw file):

Previously, petermattis (Peter Mattis) wrote…

Something isn't correct here. I still haven't been able to track down the TestReplicateAfterTruncation which is due to a discrepancy in the MVCC stats before and after the replication.

The test failure was caused by an incorrectly synthesized raft hard state. Now I'm synthesizing it correctly (I think). But see the `TODO(bdarnell)`.

Comments from Reviewable

@tamird
Copy link
Contributor

tamird commented Jun 27, 2016

Reviewed 4 of 4 files at r3.
Review status: all files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.


Comments from Reviewable

@petermattis
Copy link
Collaborator Author

Btw, prior to this change the initial replication in a 3 node cluster was taking ~2-3 secs. After this change the initial replication is taking ~1 sec.


Review status: all files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.


Comments from Reviewable

@petermattis petermattis force-pushed the pmattis/preemptive-snapshots branch from df0526a to 4345dfb Compare June 27, 2016 16:00
@petermattis
Copy link
Collaborator Author

I took a look through #4204 and didn't see anything worrisome about the new usage of r.mu.replicaID introduced by this PR. Added a comment to Replica.mu.replicaID.


Review status: 3 of 4 files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.


Comments from Reviewable

@tamird
Copy link
Contributor

tamird commented Jun 27, 2016

To clarify, I wasn't specifically concerned about an interaction with #4204; I was pointing out that that PR introduced different special casing of the zero replica ID, which along with this PR contributes to the overall confusion around this field.

Anyway, :lgtm:


Reviewed 1 of 1 files at r4.
Review status: all files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.


Comments from Reviewable

@petermattis
Copy link
Collaborator Author

Ok. I don't think there is a rush to merge this PR. Let's wait for @bdarnell to weigh in.


Review status: all files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.


Comments from Reviewable

@petermattis petermattis force-pushed the pmattis/preemptive-snapshots branch 4 times, most recently from 44072b0 to 87468e9 Compare June 29, 2016 12:06
@bdarnell
Copy link
Contributor

bdarnell commented Jul 1, 2016

Related: #6144. It may be simpler in the long run to make these incomplete replicas a separate type, although as my comment below points out there are now multiple kinds of incomplete replicas.

:lgtm:


Review status: 2 of 4 files reviewed at latest revision, 3 unresolved discussions, some commit checks failed.


storage/replica_raftstorage.go, line 604 [r5] (raw file):

      // correctly.
      //
      // TODO(bdarnell): Is this koshder? I'm concerned that the synthesized hard

Yes, this is fine. The important thing about the persisted value of HardState.Term is that it is updated whenever the node casts a vote (and never moves backwards), so it knows not to vote in the same term twice.


storage/store.go, line 2118 [r5] (raw file):

  if req.ToReplica.ReplicaID == 0 {
      if req.Message.Type == raftpb.MsgSnap {
          // Allow snapshots to be applied to uninitialized replicas (i.e. replicas

There are two ways a replica can be uninitialized: it may not know its start and end keys, and it may not have a replica ID. Replica.IsInitialized() means the former, so we shouldn't use the word "initialized" here to mean the latter (unless we can come up with better names for these two states)


Comments from Reviewable

@petermattis
Copy link
Collaborator Author

Review status: 2 of 4 files reviewed at latest revision, 3 unresolved discussions, some commit checks failed.


storage/replica_raftstorage.go, line 604 [r5] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Yes, this is fine. The important thing about the persisted value of HardState.Term is that it is updated whenever the node casts a vote (and never moves backwards), so it knows not to vote in the same term twice.

`koshder`? My spelling has deteriorated.

Ok, I've removed the TODO.


storage/store.go, line 2118 [r5] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

There are two ways a replica can be uninitialized: it may not know its start and end keys, and it may not have a replica ID. Replica.IsInitialized() means the former, so we shouldn't use the word "initialized" here to mean the latter (unless we can come up with better names for these two states)

How about: `Allow snapshots to be applied to replicas before they are members of the raft group (i.e. replicas with an ID of 0). This is the only operation that can be performed on a replica before it is part of the raft group.`

Comments from Reviewable

@tbg
Copy link
Member

tbg commented Jul 1, 2016

Review status: 2 of 4 files reviewed at latest revision, 3 unresolved discussions, some commit checks failed.


storage/replica_raftstorage.go, line 602 [r2] (raw file):

Previously, petermattis (Peter Mattis) wrote…

The test failure was caused by an incorrectly synthesized raft hard state. Now I'm synthesizing it correctly (I think). But see the TODO(bdarnell).

@bdarnell, could you weigh in regarding my comment above? I'm not sure there's a connection, but want to make sure.

Comments from Reviewable

@bdarnell
Copy link
Contributor

bdarnell commented Jul 1, 2016

Review status: 2 of 4 files reviewed at latest revision, 3 unresolved discussions, all commit checks successful.


storage/replica_raftstorage.go, line 602 [r2] (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

@bdarnell, could you weigh in regarding my comment above? I'm not sure there's a connection, but want to make sure.

Shouldn't be an issue. The problem before was when data existed on disk without an in-memory `Replica` object, but we have one here when we apply the snapshot.

storage/store.go, line 2118 [r5] (raw file):

Previously, petermattis (Peter Mattis) wrote…

How about: Allow snapshots to be applied to replicas before they are members of the raft group (i.e. replicas with an ID of 0). This is the only operation that can be performed on a replica before it is part of the raft group.

SGTM

Comments from Reviewable

@tbg
Copy link
Member

tbg commented Jul 1, 2016

Review status: 2 of 4 files reviewed at latest revision, 3 unresolved discussions, all commit checks successful.


storage/replica_raftstorage.go, line 602 [r2] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Shouldn't be an issue. The problem before was when data existed on disk without an in-memory Replica object, but we have one here when we apply the snapshot.

I see. And if we crash in an awkward moment, the partial state will be removed eagerly when the Store starts, so no issue there, so all is well. Right?

Comments from Reviewable

Allow Replicas to be created which do not have their Raft replica ID
configured. These Replicas only allow snapshots to be applied. When
applying snapshots to these replicas, we initialize the raft group state
similarly to splitting a range.

Fixes cockroachdb#7372.
@petermattis petermattis force-pushed the pmattis/preemptive-snapshots branch from 63b67e6 to 74388f3 Compare July 1, 2016 20:57
@bdarnell
Copy link
Contributor

bdarnell commented Jul 1, 2016

Review status: 1 of 4 files reviewed at latest revision, 2 unresolved discussions, some commit checks pending.


storage/replica_raftstorage.go, line 602 [r2] (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I see. And if we crash in an awkward moment, the partial state will be removed eagerly when the Store starts, so no issue there, so all is well. Right?

What awkward moment? It's all done in one batch; there shouldn't be partial state.

Comments from Reviewable

@petermattis petermattis merged commit 9affd25 into cockroachdb:master Jul 1, 2016
@petermattis petermattis deleted the pmattis/preemptive-snapshots branch July 1, 2016 21:44
@tamird
Copy link
Contributor

tamird commented Jul 1, 2016

Reviewed 1 of 2 files at r6, 2 of 2 files at r7.
Review status: all files reviewed at latest revision, 2 unresolved discussions, all commit checks successful.


Comments from Reviewable

@tbg
Copy link
Member

tbg commented Jul 1, 2016

I'm very much in favor of #6144. This code is too subtle, and leaving that subtlety implicit much longer can only lead to time wasted debugging.


Reviewed 2 of 4 files at r3, 2 of 2 files at r6, 2 of 2 files at r7.
Review status: all files reviewed at latest revision, 3 unresolved discussions, all commit checks successful.


storage/replica_raftstorage.go, line 602 [r2] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

What awkward moment? It's all done in one batch; there shouldn't be partial state.

I meant awkward moment between receiving the snapshot and actually being part of the replica set.

storage/replica_raftstorage.go, line 603 [r7] (raw file):

      // hard state for the replica in order for the Raft state machine to start
      // correctly.
      if err := updateHardState(batch, s); err != nil {

I think this is related to #6991 (comment). Shouldn't we always be writing a HardState in the same batch?


Comments from Reviewable

@petermattis
Copy link
Collaborator Author

Heh, #6144 asks for an uninitializedReplica which has a raft group but no snapshot. But this PR wants a new unattachedReplica which has a snapshot but is not attached to a raft group. Not sure if these additional replica types will help in practice. Can someone more knowledgeable sketch out how the store would contain maps to these replica types?


Review status: all files reviewed at latest revision, 3 unresolved discussions, all commit checks successful.


Comments from Reviewable

@bdarnell
Copy link
Contributor

bdarnell commented Jul 3, 2016

Yeah, on further reflection #6144 is unrelated. These non-member-replicas do need to be more-or-less full-fledged replicas so they can receive log entries and process their addition to the group. I'm not seeing a good way to clean this up.

I meant awkward moment between receiving the snapshot and actually being part of the replica set.

That should be OK. The replica data will be there but if we're not a member it will be destroyed on restart. Then if we are still a member, raft will send a (non-preemptive) snapshot to bring the replica back.

@bdarnell bdarnell mentioned this pull request Jul 7, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants