storage: eager GC for replicas blocking snapshots #10426

petermattis · 2016-11-03T00:25:52Z

Consider the following situation on a Range X:

the range lives on nodes 1,2,3
gets rebalanced to nodes 1,2,4 while node 3 is down
node 3 restarts, but the (now removed) replica remains
quiescent
the range splits
the right hand side of the split gets rebalanced to
nodes 1,2,3.

In order to receive a snapshot for in the last step, node 3 needs to
have garbage-collected its old pre-split replica. If it weren't for node
3's downtime this would normally happen eagerly as its peers will inform
it of its fate. In this scenario however, one would have to wait until
a) a client request creates the Raft group (which is unlikely as it
isn't being addressed any more) or b) a queue picks it up (which can
take a long time).

Instead, when returning an overlap and the overlap appears to be
inactive, we add to the GC queue (which in turn activates the
replica). For the situation above, we could also hope to only create the
Raft group (as that would likely, but not necessarily, bring it into
contact with its ex-peers). However, none of the old members may still
be around in larger clusters, so going for the GC queue directly is the
better option.

Closes #8942

This change is

petermattis · 2016-11-03T00:28:34Z

This is a based on @tschottdorf #8944 with the primary difference being the addition of a test exercising the scenario described in the commit message.

tamird · 2016-11-03T02:28:52Z

Reviewed 2 of 2 files at r1.
Review status: all files reviewed at latest revision, 7 unresolved discussions, some commit checks failed.

pkg/storage/client_raft_test.go, line 687 at r1 (raw file):

  defer leaktest.AfterTest(t)()

  sc := storage.TestStoreConfig()

this needs a rebase

pkg/storage/client_raft_test.go, line 688 at r1 (raw file):

  sc := storage.TestStoreConfig()
  sc.TestingKnobs.DisableReplicaGCQueue = true

deserves a comment

pkg/storage/client_raft_test.go, line 692 at r1 (raw file):

  mtc.Start(t, 3)
  defer mtc.Stop()
  rng1, err := mtc.stores[0].GetReplica(1)

s/rng/rep/

pkg/storage/client_raft_test.go, line 745 at r1 (raw file):

  expected := "snapshot intersects existing range"
  if err := replicateRange2(); !testutils.IsError(err, expected) {
      t.Fatalf("expected error, but found %v", err)

"unexpected error %v"?

pkg/storage/client_raft_test.go, line 747 at r1 (raw file):

      t.Fatalf("expected error, but found %v", err)
  }
  mtc.stores[2].SetReplicaGCQueueActive(true)

comment?

pkg/storage/store.go, line 3516 at r1 (raw file):

                  log.Errorf(ctx, "%s: unable to add replica to GC queue: %s", exReplica, err)
              } else {
                  msg += "; initiated GC: "

remove space after the colon

pkg/storage/store.go, line 3520 at r1 (raw file):

          }
      }
      return nil, errors.Errorf("%s %v", msg, exReplica) // exReplica can be nil

verb for exReplica shouldn't have changed, should be %s still as it is elsewhere

Comments from Reviewable

bdarnell · 2016-11-03T08:43:06Z

Review status: all files reviewed at latest revision, 7 unresolved discussions, some commit checks failed.

Comments from Reviewable

tbg · 2016-11-03T11:18:18Z

I think you should make this close #8942.

Review status: all files reviewed at latest revision, 9 unresolved discussions, some commit checks failed.

pkg/storage/client_raft_test.go, line 684 at r1 (raw file):

}

func TestReplicateAfterRemoveAndSplit(t *testing.T) {

Brief synopsis of the test would be good.

pkg/storage/client_raft_test.go, line 715 at r1 (raw file):

  mtc.restartStore(2)

  replicateRange2 := func() error {

Shouldn't you look up the right hand side of the split's RangeID instead of making the assumption that it is going to be 2?

Comments from Reviewable

petermattis · 2016-11-03T12:38:11Z

Review status: 0 of 2 files reviewed at latest revision, 9 unresolved discussions.

pkg/storage/client_raft_test.go, line 684 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Brief synopsis of the test would be good.

Done.

pkg/storage/client_raft_test.go, line 687 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

this needs a rebase

Done.

pkg/storage/client_raft_test.go, line 688 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

deserves a comment

Done.

pkg/storage/client_raft_test.go, line 692 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

s/rng/rep/

Done.

pkg/storage/client_raft_test.go, line 715 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Shouldn't you look up the right hand side of the split's RangeID instead of making the assumption that it is going to be 2?

Done.

pkg/storage/client_raft_test.go, line 745 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

"unexpected error %v"?

Done.

pkg/storage/client_raft_test.go, line 747 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

comment?

Done.

pkg/storage/store.go, line 3516 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

remove space after the colon

Done.

pkg/storage/store.go, line 3520 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

verb for exReplica shouldn't have changed, should be %s still as it is elsewhere

As the comment explains, `exReplica` can be `nil` which means we should use `%v`, right?

Comments from Reviewable

tbg · 2016-11-03T12:41:04Z

Reviewed 2 of 2 files at r1, 2 of 2 files at r2.
Review status: all files reviewed at latest revision, 7 unresolved discussions, some commit checks pending.

pkg/storage/store.go, line 3520 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

As the comment explains, exReplica can be nil which means we should use %v, right?

Looks right to me.

Comments from Reviewable

Consider the following situation on a Range X: - the range lives on nodes 1,2,3 - gets rebalanced to nodes 1,2,4 while node 3 is down - node 3 restarts, but the (now removed) replica remains - quiescent - the range splits - the right hand side of the split gets rebalanced to - nodes 1,2,3. In order to receive a snapshot for in the last step, node 3 needs to have garbage-collected its old pre-split replica. If it weren't for node 3's downtime this would normally happen eagerly as its peers will inform it of its fate. In this scenario however, one would have to wait until a) a client request creates the Raft group (which is unlikely as it isn't being addressed any more) or b) a queue picks it up (which can take a long time). Instead, when returning an overlap and the overlap appears to be inactive, we add to the GC queue (which in turn activates the replica). For the situation above, we could also hope to only create the Raft group (as that would likely, but not necessarily, bring it into contact with its ex-peers). However, none of the old members may still be around in larger clusters, so going for the GC queue directly is the better option.

tamird · 2016-11-03T14:42:51Z

Reviewed 1 of 2 files at r2, 1 of 1 files at r3.
Review status: all files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.

pkg/storage/store.go, line 3520 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Looks right to me.

Duh. Sorry about that.

Comments from Reviewable

If a store receives a snapshot that overlaps an existing replica, we take it as a sign that the local replica may no longer be a member of its range and queue it for processing in the replica GC queue. When this code was added (cockroachdb#10426), the replica GC queue was quite aggressive about processing replicas, and so the implementation was careful to only queue a replica if it looked "inactive." Unfortunately, this inactivity check rotted when epoch-based leases were introduced a month later (cockroachdb#10305). An inactive replica with an epoch-based lease can incorrectly be considered active, even if all of the other members of the range have stopped sending it messages, because the epoch-based lease will continue to be heartbeated by the node itself. (With an expiration-based lease, the replica's local copy of the lease would quickly expire if the other members of the range stopped sending it messages.) Fixing the inactivity check to work with epoch-based leases is rather tricky. Quiescent replicas are nearly indistinguishable from abandoned replicas. This commit just removes the inactivity check and unconditionally queues replicas for GC if they intersect an incoming snapshot. The replica GC queue check is relatively cheap (one or two meta2 lookups), and overlapping snapshot situations are not expected to last for very long. Release note: None

petermattis force-pushed the pmattis/eager-gc-replica-blocking-snapshots branch from 6f66401 to 70ef8d5 Compare November 3, 2016 12:37

petermattis force-pushed the pmattis/eager-gc-replica-blocking-snapshots branch from 70ef8d5 to 6f42385 Compare November 3, 2016 12:38

petermattis force-pushed the pmattis/eager-gc-replica-blocking-snapshots branch from 6f42385 to 809149d Compare November 3, 2016 13:32

petermattis merged commit 81eaef6 into cockroachdb:master Nov 3, 2016

petermattis deleted the pmattis/eager-gc-replica-blocking-snapshots branch November 3, 2016 14:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: eager GC for replicas blocking snapshots #10426

storage: eager GC for replicas blocking snapshots #10426

petermattis commented Nov 3, 2016 •

edited

Loading

petermattis commented Nov 3, 2016

tamird commented Nov 3, 2016

bdarnell commented Nov 3, 2016

tbg commented Nov 3, 2016

petermattis commented Nov 3, 2016

tbg commented Nov 3, 2016

tamird commented Nov 3, 2016

storage: eager GC for replicas blocking snapshots #10426

storage: eager GC for replicas blocking snapshots #10426

Conversation

petermattis commented Nov 3, 2016 • edited Loading

petermattis commented Nov 3, 2016

tamird commented Nov 3, 2016

bdarnell commented Nov 3, 2016

tbg commented Nov 3, 2016

petermattis commented Nov 3, 2016

tbg commented Nov 3, 2016

tamird commented Nov 3, 2016

petermattis commented Nov 3, 2016 •

edited

Loading