Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: eager GC for replicas blocking snapshots #10426

Conversation

petermattis
Copy link
Collaborator

@petermattis petermattis commented Nov 3, 2016

Consider the following situation on a Range X:

  • the range lives on nodes 1,2,3
  • gets rebalanced to nodes 1,2,4 while node 3 is down
  • node 3 restarts, but the (now removed) replica remains
  • quiescent
  • the range splits
  • the right hand side of the split gets rebalanced to
  • nodes 1,2,3.

In order to receive a snapshot for in the last step, node 3 needs to
have garbage-collected its old pre-split replica. If it weren't for node
3's downtime this would normally happen eagerly as its peers will inform
it of its fate. In this scenario however, one would have to wait until
a) a client request creates the Raft group (which is unlikely as it
isn't being addressed any more) or b) a queue picks it up (which can
take a long time).

Instead, when returning an overlap and the overlap appears to be
inactive, we add to the GC queue (which in turn activates the
replica). For the situation above, we could also hope to only create the
Raft group (as that would likely, but not necessarily, bring it into
contact with its ex-peers). However, none of the old members may still
be around in larger clusters, so going for the GC queue directly is the
better option.

Closes #8942


This change is Reviewable

@petermattis
Copy link
Collaborator Author

This is a based on @tschottdorf #8944 with the primary difference being the addition of a test exercising the scenario described in the commit message.

@tamird
Copy link
Contributor

tamird commented Nov 3, 2016

Reviewed 2 of 2 files at r1.
Review status: all files reviewed at latest revision, 7 unresolved discussions, some commit checks failed.


pkg/storage/client_raft_test.go, line 687 at r1 (raw file):

  defer leaktest.AfterTest(t)()

  sc := storage.TestStoreConfig()

this needs a rebase


pkg/storage/client_raft_test.go, line 688 at r1 (raw file):

  sc := storage.TestStoreConfig()
  sc.TestingKnobs.DisableReplicaGCQueue = true

deserves a comment


pkg/storage/client_raft_test.go, line 692 at r1 (raw file):

  mtc.Start(t, 3)
  defer mtc.Stop()
  rng1, err := mtc.stores[0].GetReplica(1)

s/rng/rep/


pkg/storage/client_raft_test.go, line 745 at r1 (raw file):

  expected := "snapshot intersects existing range"
  if err := replicateRange2(); !testutils.IsError(err, expected) {
      t.Fatalf("expected error, but found %v", err)

"unexpected error %v"?


pkg/storage/client_raft_test.go, line 747 at r1 (raw file):

      t.Fatalf("expected error, but found %v", err)
  }
  mtc.stores[2].SetReplicaGCQueueActive(true)

comment?


pkg/storage/store.go, line 3516 at r1 (raw file):

                  log.Errorf(ctx, "%s: unable to add replica to GC queue: %s", exReplica, err)
              } else {
                  msg += "; initiated GC: "

remove space after the colon


pkg/storage/store.go, line 3520 at r1 (raw file):

          }
      }
      return nil, errors.Errorf("%s %v", msg, exReplica) // exReplica can be nil

verb for exReplica shouldn't have changed, should be %s still as it is elsewhere


Comments from Reviewable

@bdarnell
Copy link
Contributor

bdarnell commented Nov 3, 2016

:lgtm:


Review status: all files reviewed at latest revision, 7 unresolved discussions, some commit checks failed.


Comments from Reviewable

@tbg
Copy link
Member

tbg commented Nov 3, 2016

I think you should make this close #8942.
:LGTM:


Review status: all files reviewed at latest revision, 9 unresolved discussions, some commit checks failed.


pkg/storage/client_raft_test.go, line 684 at r1 (raw file):

}

func TestReplicateAfterRemoveAndSplit(t *testing.T) {

Brief synopsis of the test would be good.


pkg/storage/client_raft_test.go, line 715 at r1 (raw file):

  mtc.restartStore(2)

  replicateRange2 := func() error {

Shouldn't you look up the right hand side of the split's RangeID instead of making the assumption that it is going to be 2?


Comments from Reviewable

@petermattis petermattis force-pushed the pmattis/eager-gc-replica-blocking-snapshots branch from 6f66401 to 70ef8d5 Compare November 3, 2016 12:37
@petermattis
Copy link
Collaborator Author

Review status: 0 of 2 files reviewed at latest revision, 9 unresolved discussions.


pkg/storage/client_raft_test.go, line 684 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Brief synopsis of the test would be good.

Done.

pkg/storage/client_raft_test.go, line 687 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

this needs a rebase

Done.

pkg/storage/client_raft_test.go, line 688 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

deserves a comment

Done.

pkg/storage/client_raft_test.go, line 692 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

s/rng/rep/

Done.

pkg/storage/client_raft_test.go, line 715 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Shouldn't you look up the right hand side of the split's RangeID instead of making the assumption that it is going to be 2?

Done.

pkg/storage/client_raft_test.go, line 745 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

"unexpected error %v"?

Done.

pkg/storage/client_raft_test.go, line 747 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

comment?

Done.

pkg/storage/store.go, line 3516 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

remove space after the colon

Done.

pkg/storage/store.go, line 3520 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

verb for exReplica shouldn't have changed, should be %s still as it is elsewhere

As the comment explains, `exReplica` can be `nil` which means we should use `%v`, right?

Comments from Reviewable

@petermattis petermattis force-pushed the pmattis/eager-gc-replica-blocking-snapshots branch from 70ef8d5 to 6f42385 Compare November 3, 2016 12:38
@tbg
Copy link
Member

tbg commented Nov 3, 2016

Reviewed 2 of 2 files at r1, 2 of 2 files at r2.
Review status: all files reviewed at latest revision, 7 unresolved discussions, some commit checks pending.


pkg/storage/store.go, line 3520 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

As the comment explains, exReplica can be nil which means we should use %v, right?

Looks right to me.

Comments from Reviewable

Consider the following situation on a Range X:

- the range lives on nodes 1,2,3
- gets rebalanced to nodes 1,2,4 while node 3 is down
- node 3 restarts, but the (now removed) replica remains
- quiescent
- the range splits
- the right hand side of the split gets rebalanced to
- nodes 1,2,3.

In order to receive a snapshot for in the last step, node 3 needs to
have garbage-collected its old pre-split replica. If it weren't for node
3's downtime this would normally happen eagerly as its peers will inform
it of its fate. In this scenario however, one would have to wait until
a) a client request creates the Raft group (which is unlikely as it
isn't being addressed any more) or b) a queue picks it up (which can
take a long time).

Instead, when returning an overlap and the overlap appears to be
inactive, we add to the GC queue (which in turn activates the
replica). For the situation above, we could also hope to only create the
Raft group (as that would likely, but not necessarily, bring it into
contact with its ex-peers). However, none of the old members may still
be around in larger clusters, so going for the GC queue directly is the
better option.
@petermattis petermattis force-pushed the pmattis/eager-gc-replica-blocking-snapshots branch from 6f42385 to 809149d Compare November 3, 2016 13:32
@petermattis petermattis merged commit 81eaef6 into cockroachdb:master Nov 3, 2016
@petermattis petermattis deleted the pmattis/eager-gc-replica-blocking-snapshots branch November 3, 2016 14:29
@tamird
Copy link
Contributor

tamird commented Nov 3, 2016

:lgtm:


Reviewed 1 of 2 files at r2, 1 of 1 files at r3.
Review status: all files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.


pkg/storage/store.go, line 3520 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Looks right to me.

Duh. Sorry about that.

Comments from Reviewable

benesch added a commit to benesch/cockroach that referenced this pull request Oct 18, 2018
If a store receives a snapshot that overlaps an existing replica, we
take it as a sign that the local replica may no longer be a member of
its range and queue it for processing in the replica GC queue.

When this code was added (cockroachdb#10426), the replica GC queue was quite
aggressive about processing replicas, and so the implementation was
careful to only queue a replica if it looked "inactive." Unfortunately,
this inactivity check rotted when epoch-based leases were introduced a
month later (cockroachdb#10305). An inactive replica with an epoch-based lease can
incorrectly be considered active, even if all of the other members of
the range have stopped sending it messages, because the epoch-based
lease will continue to be heartbeated by the node itself. (With an
expiration-based lease, the replica's local copy of the lease would
quickly expire if the other members of the range stopped sending it
messages.)

Fixing the inactivity check to work with epoch-based leases is rather
tricky. Quiescent replicas are nearly indistinguishable from
abandoned replicas. This commit just removes the inactivity check and
unconditionally queues replicas for GC if they intersect an incoming
snapshot. The replica GC queue check is relatively cheap (one or two
meta2 lookups), and overlapping snapshot situations are not expected to
last for very long.

Release note: None
benesch added a commit to benesch/cockroach that referenced this pull request Oct 18, 2018
If a store receives a snapshot that overlaps an existing replica, we
take it as a sign that the local replica may no longer be a member of
its range and queue it for processing in the replica GC queue.

When this code was added (cockroachdb#10426), the replica GC queue was quite
aggressive about processing replicas, and so the implementation was
careful to only queue a replica if it looked "inactive." Unfortunately,
this inactivity check rotted when epoch-based leases were introduced a
month later (cockroachdb#10305). An inactive replica with an epoch-based lease can
incorrectly be considered active, even if all of the other members of
the range have stopped sending it messages, because the epoch-based
lease will continue to be heartbeated by the node itself. (With an
expiration-based lease, the replica's local copy of the lease would
quickly expire if the other members of the range stopped sending it
messages.)

Fixing the inactivity check to work with epoch-based leases is rather
tricky. Quiescent replicas are nearly indistinguishable from
abandoned replicas. This commit just removes the inactivity check and
unconditionally queues replicas for GC if they intersect an incoming
snapshot. The replica GC queue check is relatively cheap (one or two
meta2 lookups), and overlapping snapshot situations are not expected to
last for very long.

Release note: None
benesch added a commit to benesch/cockroach that referenced this pull request Oct 19, 2018
If a store receives a snapshot that overlaps an existing replica, we
take it as a sign that the local replica may no longer be a member of
its range and queue it for processing in the replica GC queue.

When this code was added (cockroachdb#10426), the replica GC queue was quite
aggressive about processing replicas, and so the implementation was
careful to only queue a replica if it looked "inactive." Unfortunately,
this inactivity check rotted when epoch-based leases were introduced a
month later (cockroachdb#10305). An inactive replica with an epoch-based lease can
incorrectly be considered active, even if all of the other members of
the range have stopped sending it messages, because the epoch-based
lease will continue to be heartbeated by the node itself. (With an
expiration-based lease, the replica's local copy of the lease would
quickly expire if the other members of the range stopped sending it
messages.)

Fixing the inactivity check to work with epoch-based leases is rather
tricky. Quiescent replicas are nearly indistinguishable from
abandoned replicas. This commit just removes the inactivity check and
unconditionally queues replicas for GC if they intersect an incoming
snapshot. The replica GC queue check is relatively cheap (one or two
meta2 lookups), and overlapping snapshot situations are not expected to
last for very long.

Release note: None
benesch added a commit to benesch/cockroach that referenced this pull request Oct 21, 2018
If a store receives a snapshot that overlaps an existing replica, we
take it as a sign that the local replica may no longer be a member of
its range and queue it for processing in the replica GC queue.

When this code was added (cockroachdb#10426), the replica GC queue was quite
aggressive about processing replicas, and so the implementation was
careful to only queue a replica if it looked "inactive." Unfortunately,
this inactivity check rotted when epoch-based leases were introduced a
month later (cockroachdb#10305). An inactive replica with an epoch-based lease can
incorrectly be considered active, even if all of the other members of
the range have stopped sending it messages, because the epoch-based
lease will continue to be heartbeated by the node itself. (With an
expiration-based lease, the replica's local copy of the lease would
quickly expire if the other members of the range stopped sending it
messages.)

Fixing the inactivity check to work with epoch-based leases is rather
tricky. Quiescent replicas are nearly indistinguishable from
abandoned replicas. This commit just removes the inactivity check and
unconditionally queues replicas for GC if they intersect an incoming
snapshot. The replica GC queue check is relatively cheap (one or two
meta2 lookups), and overlapping snapshot situations are not expected to
last for very long.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants