Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: allow healing ranges whose replicas are all lost #56333

Conversation

TheSamHuang
Copy link
Contributor

@TheSamHuang TheSamHuang commented Nov 5, 2020

Introduces and implements an ResetQuorumRequest RPC. ResetQuorumRequest takes in the range id of a range that that is unavailable due to lost quorum and makes it available again, at the cost of losing all of the data in that range. Any existing replica, even one residing on the target node, will irrevocably be removed. ResetQuorumRequest first uses meta2 to identify the range descriptor. Then, it removes all replicas from the range descriptor and adds a store from the target node as the one designated survivor replica. This change is then written to meta2 and sent as a snapshot to a store local to the target node in order to use crdb internal upreplication and rebalancing mechanisms to create further replicas from this fresh snapshot.

This RPC is meant to be called by the user directly via the CLI command being introduced in #57034. It will not work for ranges that have not lost quorum or for a meta range.

@TheSamHuang TheSamHuang requested a review from a team as a code owner November 5, 2020 18:01
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@TheSamHuang TheSamHuang force-pushed the 41411-allow-healing-ranges-with-all-replicas-lost branch 5 times, most recently from fef63fe to 0a9a71a Compare November 11, 2020 21:01
@TheSamHuang TheSamHuang requested a review from tbg November 11, 2020 21:01
pkg/server/node.go Outdated Show resolved Hide resolved
Copy link
Contributor Author

@TheSamHuang TheSamHuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @tbg)

a discussion (no related file):
Should update RangeLog (best effort) as well. RangeLog is updated when a range is updated (upreplicated, etc.), it's a system of record of operations / audit log.

Should populate regular LogEntries as well.



pkg/kv/kvserver/store_snapshot.go, line 922 at r1 (raw file):

	eng := storage.NewInMem(ctx, roachpb.Attributes{}, 1<<20 /* cacheSize */)
	defer eng.Close()

Create engine to use as buffer for empty snapshot


pkg/kv/kvserver/store_snapshot.go, line 927 at r1 (raw file):

	var ms enginepb.MVCCStats
	if err := storage.MVCCPutProto(
		ctx, eng, &ms, keys.RangeDescriptorKey(desc.StartKey), now, nil /* txn */, &desc,

Seeding empty range


pkg/kv/kvserver/store_snapshot.go, line 931 at r1 (raw file):

		return err
	}
	ms, err := stateloader.WriteInitialReplicaState(

Seeding empty range


pkg/kv/kvserver/store_snapshot.go, line 944 at r1 (raw file):

	}

	sl := stateloader.Make(desc.RangeID)

Use stateloader to load state out of memory from the engine earlier.


pkg/kv/kvserver/store_snapshot.go, line 963 at r1 (raw file):

		snapUUID,
		sl,
		SnapshotRequest_RAFT,

May want a separate SnapshotRequest type for recovery that always goes through; bypass all throttling so they cannot be declined. We don't want our operation to be held up behind a long running snapshot. We want this to go through quickly.


pkg/kv/kvserver/store_snapshot.go, line 966 at r1 (raw file):

		eng,
		desc.RangeID,
		raftentry.NewCache(1),

Won't actually be used.


pkg/kv/kvserver/store_snapshot.go, line 967 at r1 (raw file):

		desc.RangeID,
		raftentry.NewCache(1),
		func(func(SideloadStorage) error) error { return nil },

Used for sstables, but not needed here as there are no logs.


pkg/kv/kvserver/store_snapshot.go, line 975 at r1 (raw file):

	}

	from := to

Same because we have to send this from somewhere, makes it look like we are sending the snapshot from ourself. We have to send it from a member of the RangeDescriptor, otherwise it would be malformed if it came from a non-member.


pkg/roachpb/api.proto, line 2201 at r1 (raw file):

}

message UnsafeHealRangeRequest {

TODO: Add comment

Pass in updated RangeDescriptor (instead of RangeID only) because this is a lower level primitive; not expecting the user to call this directly.

Caller updates meta2 and then calls this RPC to update the snapshot

Two cases:
A. All replicas are lost
B. One or more replicas remaining, resuscitate range

If this request only took a range ID, we would have to look up the range descriptor. If recovering from an existing replica, we want to know where we are recovering from.

Higher level abstractions could hit meta2 or use a survivor for the range descriptor.

--

node_id and store_id are the designated survivor or designated place we are recreating the replica

--
mention somewhere why we are covering both use cases in one RPC

--
Also, work with @tbg to explain how this can be safely used and how the orchestrator around this will work. In particular, restarting with a missing node vs a missing store.


pkg/roachpb/api.proto, line 2256 at r1 (raw file):

  rpc RangeFeed          (RangeFeedRequest)          returns (stream RangeFeedEvent)          {}
  rpc GossipSubscription (GossipSubscriptionRequest) returns (stream GossipSubscriptionEvent) {}
  rpc UnsafeHealRange (UnsafeHealRangeRequest) returns (UnsafeHealRangeResponse) {}

Does not make senes to return new RangeDescriptor because higher level controller computes it so it should be passed in


pkg/server/node.go, line 1088 at r1 (raw file):

Previously, TheSamHuang (Sam Huang) wrote…

@tbg We discussed how we wanted the higher level abstraction to include the RangeDescriptor in the RPC call, rather than just the RangeID. However, I think we still have to scan through all ranges to find the expValue as we need it for CPut. Do you think it makes sense to have the higher level abstraction also pass in expValue?

CPut would live at caller, if necessary.

Low level RPC should just send snapshot. Want it to be as specific to the range as possible (nothing distributed). Other logic should be done somewhere that has more insight into state of the cluster.

(Add this comment to protobuf definition)


pkg/server/node.go, line 1109 at r1 (raw file):

rpc.DefaultClass

Use rpc.SystemClass to avoid throttling (add code comment)

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 6 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @tbg, and @TheSamHuang)


pkg/kv/kvserver/raft_transport.go, line 661 at r1 (raw file):

	sent func(),
) error {

nit: remove this empty line


pkg/kv/kvserver/raft_transport.go, line 667 at r1 (raw file):

		}
	}()
	return sendSnapshot(

nit: maybe add one here


pkg/kv/kvserver/store_snapshot.go, line 914 at r1 (raw file):

// Formerly HackSendSnapshot
func SendEmptySnapshot(

request to add a descriptive comment above the function

also comments throughout the body to explain the main phases of the control flow


pkg/server/node.go, line 1058 at r1 (raw file):

}

func (n *Node) UnsafeHealRange(

Add a comment on top of this function.

Also suggestion to add comments throughout the function body, to indicate the general "phases" of the control flow.


pkg/server/node.go, line 1100 at r1 (raw file):

	for _, rd := range dead {
		desc.RemoveReplica(rd.NodeID, rd.StoreID)

nit: avoid stray newlines

@TheSamHuang TheSamHuang force-pushed the 41411-allow-healing-ranges-with-all-replicas-lost branch from 0a9a71a to 19ef42f Compare November 12, 2020 17:27
@TheSamHuang
Copy link
Contributor Author

@tbg @knz Added first draft of a unit test for UnsafeHealRange.

make test PKG=./pkg/kv/kvserver TESTS=TestRecoverRangeWithNoReplicas                          
Running make with -j12
GOPATH set to /Users/samhuang/go
mkdir -p lib
rm -f lib/lib{geos,geos_c}.dylib
cp -L /Users/samhuang/go/native/x86_64-apple-darwin19.6.0/geos/lib/lib{geos,geos_c}.dylib lib
GOFLAGS= go test   -mod=vendor -tags ' make x86_64_apple_darwin19.6.0' -ldflags '-X github.com/cockroachdb/cockroach/pkg/build.typ=development -extldflags "" -X "github.com/cockroachdb/cockroach/pkg/build.tag=v20.2.0-alpha.1-4605-g19ef42fc96-dirty" -X "github.com/cockroachdb/cockroach/pkg/build.rev=19ef42fc9698d28eceb44c46ec4565e7b7b7daf1" -X "github.com/cockroachdb/cockroach/pkg/build.cgoTargetTriple=x86_64-apple-darwin19.6.0"  ' -run "TestRecoverRangeWithNoReplicas"  -timeout 30m ./pkg/kv/kvserver 
ok      github.com/cockroachdb/cockroach/pkg/kv/kvserver        12.524s

@tbg tbg requested a review from knz November 12, 2020 22:26
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 when you get to addressing the comments you posted from our group review, can you make sure to mark the ones as done that should be addressed? That helps your reviewers know what to expect (right now I'm expecting that none of them were addressed, just, as you address them, do check them off). Thanks!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @knz, @tbg, and @TheSamHuang)


pkg/kv/kvserver/recovery_test.go, line 21 at r3 (raw file):

)

func TestRecoverRangeWithNoReplicas(t *testing.T) {

Add a comment to this test explaining what it does: set up two node cluster, isolate a range to n2, stop n2, check that the range is unavailable, use UnsafeHealRange to resuscitate the range, check that it is available again.


pkg/kv/kvserver/recovery_test.go, line 32 at r3 (raw file):

	k := tc.ScratchRange(t)
	desc, err := tc.AddReplicas(k, tc.Target(1))

You need to update desc to refer to the post-change descriptor as we discussed.


pkg/kv/kvserver/recovery_test.go, line 36 at r3 (raw file):

	require.NoError(t, tc.TransferRangeLease(desc, tc.Target(1)))

	svr := tc.Server(0)

nit: we usually abbreviate srv


pkg/kv/kvserver/recovery_test.go, line 42 at r3 (raw file):

	// Sanity check that requests to the ScratchRange time out.
	cCtx, cancel := context.WithTimeout(ctx, 10000*time.Millisecond)

50*time.Millisecnod


pkg/kv/kvserver/recovery_test.go, line 47 at r3 (raw file):

	require.Equal(t, context.DeadlineExceeded, cCtx.Err())

	// TODO: Figure out: is this necessary?

No, this is boilerplate for checking that UnsafeHealRange does a CPut, but it's intentionally no longer doing so


pkg/kv/kvserver/recovery_test.go, line 80 at r3 (raw file):

		return nil
	}); err != nil {
		// TODO: update this

you can require.NoError(t, srv.....)


pkg/kv/kvserver/recovery_test.go, line 85 at r3 (raw file):

	storeID, nodeID := store.Ident.StoreID, store.Ident.NodeID

	// TODO: Figure out: is this necessary?

Yes. UnsafeHealRangeRequest won't update the meta ranges, but that's what's used for routing requests, so you need to play orchestrator here and write them manually. I think we will get the existing code for updating the meta records in better shape so that we can use it here and then later in the orchestrator. It's basically this code:

// updateRangeAddressing overwrites the meta1 and meta2 range addressing
// records for the descriptor.
func updateRangeAddressing(b *kv.Batch, desc *roachpb.RangeDescriptor) error {
return rangeAddressing(b, desc, putMeta)
}

You can call that here directly, see if that works. (srv has a DB() which can make a txn, and a txn can run a batch).


pkg/kv/kvserver/recovery_test.go, line 97 at r3 (raw file):

	)
	require.NoError(t, err)
	log.Info(ctx, "snapshot applied")

This doesn't help much, just remove it


pkg/kv/kvserver/recovery_test.go, line 99 at r3 (raw file):

	log.Info(ctx, "snapshot applied")

	require.Error(t, svr.DB().Put(cCtx, k, "baz"))

But wait, we want there to be no error (and even if we expected an error, we would want to check the precise error and not just "any")

You're also using the timeouted ctx here, so your error is probably context cancelled

@tbg tbg self-requested a review November 12, 2020 22:26
@TheSamHuang TheSamHuang force-pushed the 41411-allow-healing-ranges-with-all-replicas-lost branch from 90bb033 to 1a714ef Compare November 16, 2020 18:08
Copy link
Contributor Author

@TheSamHuang TheSamHuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely! Thank you for the comments, Tobi and Rafael!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @knz, @tbg, and @TheSamHuang)


pkg/kv/kvserver/recovery_test.go, line 21 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Add a comment to this test explaining what it does: set up two node cluster, isolate a range to n2, stop n2, check that the range is unavailable, use UnsafeHealRange to resuscitate the range, check that it is available again.

Done.


pkg/kv/kvserver/recovery_test.go, line 42 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

50*time.Millisecnod

Done.


pkg/kv/kvserver/recovery_test.go, line 47 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

No, this is boilerplate for checking that UnsafeHealRange does a CPut, but it's intentionally no longer doing so

Done.


pkg/kv/kvserver/recovery_test.go, line 80 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

you can require.NoError(t, srv.....)

Done.


pkg/kv/kvserver/recovery_test.go, line 97 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This doesn't help much, just remove it

Done.


pkg/kv/kvserver/recovery_test.go, line 99 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

But wait, we want there to be no error (and even if we expected an error, we would want to check the precise error and not just "any")

You're also using the timeouted ctx here, so your error is probably context cancelled

Done.

Copy link
Contributor Author

@TheSamHuang TheSamHuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @knz, @tbg, and @TheSamHuang)


pkg/kv/kvserver/recovery_test.go, line 85 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Yes. UnsafeHealRangeRequest won't update the meta ranges, but that's what's used for routing requests, so you need to play orchestrator here and write them manually. I think we will get the existing code for updating the meta records in better shape so that we can use it here and then later in the orchestrator. It's basically this code:

// updateRangeAddressing overwrites the meta1 and meta2 range addressing
// records for the descriptor.
func updateRangeAddressing(b *kv.Batch, desc *roachpb.RangeDescriptor) error {
return rangeAddressing(b, desc, putMeta)
}

You can call that here directly, see if that works. (srv has a DB() which can make a txn, and a txn can run a batch).

Got it, I added this in.
ok github.com/cockroachdb/cockroach/pkg/kv/kvserver 8.355s

However, since this test is in the kvserver_test package (putting it in kvserver leads to an import cycle), updateRangeAddressing is inaccessible because it is not exported. What are your thoughts of exporting updateRangeAddressing? I added a temporary workaround for the time being.

Copy link
Contributor Author

@TheSamHuang TheSamHuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @knz, @tbg, and @TheSamHuang)


pkg/kv/kvserver/recovery_test.go, line 32 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

You need to update desc to refer to the post-change descriptor as we discussed.

Done. desc is updated below.

@TheSamHuang TheSamHuang force-pushed the 41411-allow-healing-ranges-with-all-replicas-lost branch 5 times, most recently from 336e58e to 1c0d9f7 Compare November 17, 2020 19:49
Copy link
Contributor Author

@TheSamHuang TheSamHuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @knz, @tbg, and @TheSamHuang)


pkg/kv/kvserver/store_snapshot.go, line 914 at r1 (raw file):

Previously, knz (kena) wrote…

request to add a descriptive comment above the function

also comments throughout the body to explain the main phases of the control flow

Added some comments, please let me know what you think.


pkg/server/node.go, line 1058 at r1 (raw file):

Previously, knz (kena) wrote…

Add a comment on top of this function.

Also suggestion to add comments throughout the function body, to indicate the general "phases" of the control flow.

Added some comments, please let me know what you think.

@TheSamHuang TheSamHuang force-pushed the 41411-allow-healing-ranges-with-all-replicas-lost branch 2 times, most recently from 23ddc43 to b43412b Compare November 17, 2020 23:36
@irfansharif irfansharif self-requested a review November 26, 2020 17:09
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. We need to address some of the TODOs but it's getting close to merging.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @knz, @tbg, and @TheSamHuang)


pkg/kv/kvserver/addressing.go, line 60 at r6 (raw file):

// TODO (thesamhuang): Remove this temporary export
func UpdateRangeAddressing(b *kv.Batch, desc *roachpb.RangeDescriptor) error {

You can put this into helpers_test.go, that way it's exported only during tests.


pkg/kv/kvserver/recovery_test.go, line 10 at r9 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Remove this empty line, it'll run into lint failures.

crlfmt maybe not properly set up, instructions at https://cockroachlabs.slack.com/archives/C4VM697PC/p1540908498001700


pkg/kv/kvserver/recovery_test.go, line 112 at r9 (raw file):

//}

func TestResetQuorum(t *testing.T) {

Write a top-level comment like the other test had.


pkg/kv/kvserver/recovery_test.go, line 171 at r9 (raw file):

	_, err = srv.Node().(*server.Node).ResetQuorum(
		ctx,
		&roachpb.ResetQuorumRequest{

Let's also rename this file to reset_quorum_test.go.


pkg/kv/kvserver/recovery_test.go, line 178 at r9 (raw file):

	require.NoError(t, err)

	t.Logf("resetting quorum complete")

Could you look into adding few bells and whistles added to this test:

  1. if there's an existing replica and we send the recovery snapshot to that replica, we still get the quorum reset and end up with an empty range.
  2. if there's an existing replica and we send the recovery snapshot to another node that doesn't have that replica, we also still get the reset, and replicaGC removes the original survivor when triggered.

I think the best way to write these is to factor out a helper from your existing test which takes as inputs the node count, and the nodes that have a replica. That helper basically does everything up to (and excluding) the StopServer call in your test here.

Then the current test becomes:

tc := helper(t, 2 /* nodes */, 3 /* only n2 has a replica */ )
  1. becomes helper(t, 2, 1, 2), i.e. two nodes and they both have a replica, and then you'll stop n2 and send to n1
  2. becomes helper(t, 3, 1, 2), i.e. three nodes and n1 and n2 with a replica, you stop n2 and send to n3. Then check that n1's replica goes away after a short time (you can trigger it by getting to the *Store and calling MustForceReplicaGCScanAndProcess; after that store.LookupReplica should come back empty.

Btw, I'd suggest doing that in a follow-up PR. I'd like to see this one merged soon. But I'd like to see it done, as it gives us good confidence in that the resulting cli command will work in a variety of useful situations.


pkg/kv/kvserver/store_snapshot.go, line 963 at r9 (raw file):

Previously, irfansharif (irfan sharif) wrote…

nit: TODO(thesamhuang):

make this (tbg).


pkg/kv/kvserver/store_snapshot.go, line 971 at r9 (raw file):

		eng,
		desc.RangeID,
		raftentry.NewCache(1), // Cache is not used.

nit // cache is not used


pkg/kv/kvserver/store_snapshot.go, line 972 at r9 (raw file):

		desc.RangeID,
		raftentry.NewCache(1), // Cache is not used.
		func(func(SideloadStorage) error) error { return nil }, // This is used for sstables, not needed here as there are no logs.

ditto, inline comments are lower case w/o punctuation (kind of wish this wasn't a rule, but it is)


pkg/server/node.go, line 1040 at r9 (raw file):

	// Update range descriptor and update meta ranges for the descriptor.
	// Remove all (dead) replicas.

We're removing all, not all dead ones.


pkg/server/node.go, line 1045 at r9 (raw file):

		desc.RemoveReplica(rd.NodeID, rd.StoreID)
	}
	// Add current node as new replica.

this is still using req.StoreID, it should pick a store from the local node.


pkg/server/node.go, line 1049 at r9 (raw file):

	// We should increment the generation so that the various
	// caches will recognize this descriptor as newer. This
	// may not be necessary, but it can't hurt.

Remove this last sentence - it's necessary now. It wasn't that necessary in the test but in general there are invariants around the descriptor that we need to uphold here, and one of them is that the generation goes up with any replication change, split, or merge bumps the generation. Here, we are basically forcing a replication change.


pkg/server/node.go, line 1057 at r9 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Yup, I think it's necessary for safety.

+1
Note that you have to hold on to the original bytes of the descriptor (kvs[i].Value) for that.


pkg/server/node.go, line 1079 at r9 (raw file):

func (n *Node) unsafeHealRange(
	ctx context.Context, nodeID roachpb.NodeID, storeID roachpb.StoreID, desc roachpb.RangeDescriptor, to roachpb.ReplicaDescriptor,
) error {

Honestly what's here isn't even worth it as a separate method, just inline this at the caller.

@tbg tbg self-requested a review November 27, 2020 09:59
Copy link
Contributor Author

@TheSamHuang TheSamHuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @knz, @tbg, and @TheSamHuang)


pkg/kv/kvserver/addressing.go, line 60 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

You can put this into helpers_test.go, that way it's exported only during tests.

I didn't add this, would you still like me to move it? The function I had previously added (and removed since it was no longer necessary) was UpdateRangeAddressing which just exported updateRangeAddressing.


pkg/kv/kvserver/raft_transport.go, line 651 at r9 (raw file):

Previously, irfansharif (irfan sharif) wrote…

This is a somewhat hollowed out helper; all it does is forward its args to sendSnapshot. We could simply inline the definition at the caller.

Done.


pkg/kv/kvserver/recovery_test.go, line 10 at r9 (raw file):

Previously, tbg (Tobias Grieger) wrote…

crlfmt maybe not properly set up, instructions at https://cockroachlabs.slack.com/archives/C4VM697PC/p1540908498001700

Done.


pkg/kv/kvserver/recovery_test.go, line 19 at r9 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Remember to delete this commented out test and repurpose the top-level comment for below.

Done.


pkg/kv/kvserver/recovery_test.go, line 112 at r9 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Nice test!

Thanks!


pkg/kv/kvserver/recovery_test.go, line 112 at r9 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Write a top-level comment like the other test had.

Done.


pkg/kv/kvserver/recovery_test.go, line 124 at r9 (raw file):

Previously, irfansharif (irfan sharif) wrote…

[nit] Define n1, n2 := 0, 1 and use those throughout for readability.

Done.


pkg/kv/kvserver/recovery_test.go, line 171 at r9 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Let's also rename this file to reset_quorum_test.go.

Done.


pkg/kv/kvserver/store_snapshot.go, line 908 at r9 (raw file):

Previously, irfansharif (irfan sharif) wrote…

We could just avoid mentioning the helper and talk in general terms what this function is responsible for.

Done.


pkg/kv/kvserver/store_snapshot.go, line 972 at r9 (raw file):

Previously, tbg (Tobias Grieger) wrote…

ditto, inline comments are lower case w/o punctuation (kind of wish this wasn't a rule, but it is)

Done.


pkg/kv/kvserver/store_snapshot.go, line 1026 at r9 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Add a comment here.

Done. Please let me know if this isn't correct/appropriate.


pkg/roachpb/api.proto, line 2093 at r9 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Rename all references to UnsafeHealRangeRequest.

Done.


pkg/roachpb/api.proto, line 2094 at r9 (raw file):

Previously, irfansharif (irfan sharif) wrote…

This comment needs to be updated. This RPC is now the entry point called by the user directly, and internally it does the heavy lifting around updating meta2 and retrieving store IDs, etc. I think blocks of this comment will find a better home elsewhere.

Done.


pkg/roachpb/api.proto, line 2112 at r9 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Remove this TODO for now, it's already well fleshed out.

Done.


pkg/server/node.go, line 1000 at r9 (raw file):

Previously, irfansharif (irfan sharif) wrote…
// ResetQuorum implements the roachpb.InternalServer interface.

Done.


pkg/server/node.go, line 1032 at r9 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Yup, good idea to have these. We could write super tiny tests for these, invoking this RPC directly and seeing that the right errors are returned.

I added the following but don't believe it's correct:

// Check range has actually lost quorum.
	livenessMap := n.storeCfg.NodeLiveness.GetIsLiveMap()
	available := desc.Replicas().CanMakeProgress(func(rDesc roachpb.ReplicaDescriptor) bool {
		return livenessMap[rDesc.NodeID].IsLive
	})
	if available {
		return nil, errors.Errorf("targeted range to recover has not lost quorum.")
	}

When running the test case, available is true. Will be working on debugging this now, so please let me know if you have any pointers or suggestions. Thanks!


pkg/server/node.go, line 1039 at r9 (raw file):

Previously, irfansharif (irfan sharif) wrote…
// Update the range descriptor and update meta ranges for the descriptor, removing all (dead) replicas.

Done.


pkg/server/node.go, line 1040 at r9 (raw file):

Previously, tbg (Tobias Grieger) wrote…

We're removing all, not all dead ones.

Done.


pkg/server/node.go, line 1049 at r9 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Remove this last sentence - it's necessary now. It wasn't that necessary in the test but in general there are invariants around the descriptor that we need to uphold here, and one of them is that the generation goes up with any replication change, split, or merge bumps the generation. Here, we are basically forcing a replication change.

Done.


pkg/server/node.go, line 1057 at r9 (raw file):

Previously, tbg (Tobias Grieger) wrote…

+1
Note that you have to hold on to the original bytes of the descriptor (kvs[i].Value) for that.

Changed to CPut but running into an unexpected value error now. Will be trying to work this out.


pkg/server/node.go, line 1063 at r9 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Let's rename this lower level primitive as well, it's not "unsafe" anymore, is it? 😛

Done.


pkg/server/node.go, line 1070 at r9 (raw file):

Previously, irfansharif (irfan sharif) wrote…

There's no need to list out the args. Comments like these get stale quickly as things change, and it's repeating in no more succinct terms what's already visible below.

Done.


pkg/server/node.go, line 1071 at r9 (raw file):

Previously, irfansharif (irfan sharif) wrote…

This block is no longer applicable.

Done.


pkg/server/node.go, line 1076 at r9 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Talk instead about how we then rely on crdb internal upreplication/rebalancing mechanisms to create further replicas from this fresh snapshot.

Done.


pkg/server/node.go, line 1079 at r9 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Honestly what's here isn't even worth it as a separate method, just inline this at the caller.

Done.

@TheSamHuang TheSamHuang force-pushed the 41411-allow-healing-ranges-with-all-replicas-lost branch 2 times, most recently from e279e9b to 30cf688 Compare November 30, 2020 21:27
Copy link
Contributor Author

@TheSamHuang TheSamHuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @knz, @tbg, and @TheSamHuang)


pkg/server/node.go, line 1032 at r9 (raw file):

Previously, TheSamHuang (Sam Huang) wrote…

I added the following but don't believe it's correct:

// Check range has actually lost quorum.
	livenessMap := n.storeCfg.NodeLiveness.GetIsLiveMap()
	available := desc.Replicas().CanMakeProgress(func(rDesc roachpb.ReplicaDescriptor) bool {
		return livenessMap[rDesc.NodeID].IsLive
	})
	if available {
		return nil, errors.Errorf("targeted range to recover has not lost quorum.")
	}

When running the test case, available is true. Will be working on debugging this now, so please let me know if you have any pointers or suggestions. Thanks!

Decreasing LivenessDuration in the TestClusterArgs and waiting for n2 liveliness to expire allows the test case to pass.


pkg/server/node.go, line 1057 at r9 (raw file):

Previously, TheSamHuang (Sam Huang) wrote…

Changed to CPut but running into an unexpected value error now. Will be trying to work this out.

Using expValue.TagAndDataBytes() instead seems to have solved the problem.

Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still have to remove the inclusion of StoreID from the ResetQuorum request type. Tobi's prototyped it here, let's pick it up.

Should update RangeLog (best effort) as well. RangeLog is updated when a range is updated (upreplicated, etc.), it's a system of record of operations / audit log. Should populate regular LogEntries as well.

We're not planning to do these in this PR. Let's instead leave a TODO (addressed to myself or Tobi) so we don't forget.

Reviewed 2 of 4 files at r8, 1 of 6 files at r9, 6 of 7 files at r10, 2 of 2 files at r11.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz, @tbg, and @TheSamHuang)


pkg/kv/kvserver/reset_quorum_test.go, line 33 at r11 (raw file):

			Knobs: base.TestingKnobs{
				NodeLiveness: kvserver.NodeLivenessTestingKnobs{
					LivenessDuration: 3000 * time.Millisecond, // set duration so n2 liveness expires shortly after stopping.

[nit + arbitrary, questionable convention alert] we don't want a period after an inline comment. Could also simply elevate the comment to a proper one.


pkg/kv/kvserver/reset_quorum_test.go, line 65 at r11 (raw file):

	tc.StopServer(n2)
	// Wait for n2 liveness to expire.
	time.Sleep(3001 * time.Millisecond)

We can simply use the same constant (and let's define it explicitly).


pkg/kv/kvserver/store_snapshot.go, line 1026 at r9 (raw file):

Previously, TheSamHuang (Sam Huang) wrote…

Done. Please let me know if this isn't correct/appropriate.

// noopStorePool is a hollowed out StorePool that does not throttle. It's used in recovery scenarios


pkg/kv/kvserver/store_snapshot.go, line 918 at r11 (raw file):

) error {
	// Create an engine to use as a buffer for the empty snapshot.
	eng := storage.NewInMem(ctx, roachpb.Attributes{}, 1<<20 /* cacheSize */)

eng := storage.NewDefaultInMem()


pkg/server/node.go, line 1039 at r9 (raw file):

Previously, TheSamHuang (Sam Huang) wrote…

Done.

We want a space after the "//"


pkg/server/node.go, line 1035 at r10 (raw file):

	}

	// Check range has actually lost quorum.

"Check that we've actually lost quorum."
"Check that we're not a metaX range"

@tbg tbg requested a review from irfansharif December 1, 2020 10:49
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Store ID still needs to be removed and the test updated to that effect, but pretty much there other than that!

Dismissed @knz from 2 discussions.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @TheSamHuang)


pkg/kv/kvserver/addressing.go, line 60 at r6 (raw file):

Previously, TheSamHuang (Sam Huang) wrote…

I didn't add this, would you still like me to move it? The function I had previously added (and removed since it was no longer necessary) was UpdateRangeAddressing which just exported updateRangeAddressing.

No, this is fine


pkg/kv/kvserver/recovery_test.go, line 178 at r9 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Could you look into adding few bells and whistles added to this test:

  1. if there's an existing replica and we send the recovery snapshot to that replica, we still get the quorum reset and end up with an empty range.
  2. if there's an existing replica and we send the recovery snapshot to another node that doesn't have that replica, we also still get the reset, and replicaGC removes the original survivor when triggered.

I think the best way to write these is to factor out a helper from your existing test which takes as inputs the node count, and the nodes that have a replica. That helper basically does everything up to (and excluding) the StopServer call in your test here.

Then the current test becomes:

tc := helper(t, 2 /* nodes */, 3 /* only n2 has a replica */ )
  1. becomes helper(t, 2, 1, 2), i.e. two nodes and they both have a replica, and then you'll stop n2 and send to n1
  2. becomes helper(t, 3, 1, 2), i.e. three nodes and n1 and n2 with a replica, you stop n2 and send to n3. Then check that n1's replica goes away after a short time (you can trigger it by getting to the *Store and calling MustForceReplicaGCScanAndProcess; after that store.LookupReplica should come back empty.

Btw, I'd suggest doing that in a follow-up PR. I'd like to see this one merged soon. But I'd like to see it done, as it gives us good confidence in that the resulting cli command will work in a variety of useful situations.

You can close this comment if you add TODOs to your test.


pkg/kv/kvserver/reset_quorum_test.go, line 20 at r11 (raw file):

// TestResetQuorum tests the ResetQuorum method to restore quorum for a
// range whose replicas are all lost. It starts a cluster with two nodes,
// n1 and n2 with a range isolated to n2. Then, it stops n2 and checks

n1 and n2,


pkg/kv/kvserver/reset_quorum_test.go, line 21 at r11 (raw file):

// range whose replicas are all lost. It starts a cluster with two nodes,
// n1 and n2 with a range isolated to n2. Then, it stops n2 and checks
// the range is unavailable. Next, it updates the meta2 range descriptor

checks that the range


pkg/kv/kvserver/reset_quorum_test.go, line 22 at r11 (raw file):

// n1 and n2 with a range isolated to n2. Then, it stops n2 and checks
// the range is unavailable. Next, it updates the meta2 range descriptor
// entry and uses ResetQuorum to make the range accessible again.

It doesn't update meta2 any more, this is now done in ResetQuorum. But let's make this test read meta2 and check the result.


pkg/kv/kvserver/reset_quorum_test.go, line 57 at r11 (raw file):

	metaKey := keys.RangeMetaKey(desc.EndKey).AsRawKey()
	// Read the meta2 which we are about to update below while n2

which ResetQuorum will be updating


pkg/kv/kvserver/reset_quorum_test.go, line 65 at r11 (raw file):

Previously, irfansharif (irfan sharif) wrote…

We can simply use the same constant (and let's define it explicitly).

Can you avoid messing with the liveness duration + sleeps if you call, on n1, IncrementEpoch


pkg/kv/kvserver/reset_quorum_test.go, line 97 at r11 (raw file):

		&roachpb.ResetQuorumRequest{
			RangeID: int32(desc.RangeID),
			StoreID: int32(store.StoreID()),

This shouldn't be here any more.


pkg/roachpb/api.proto, line 2095 at r11 (raw file):

// ResetQuorumRequest makes a range that is unavailable due to lost quorum
// available again. ResetQuorumRequest first uses meta2 to identify the

available again, at the cost of losing all of the data in the range. Any existing replica, even one residing on the target node, will irrevocably be removed.


pkg/roachpb/api.proto, line 2097 at r11 (raw file):

// available again. ResetQuorumRequest first uses meta2 to identify the
// range descriptor. Then, it removes all replicas from the range descriptor
// and adds the current node as the one designated survivor replica. This

a store from the target node


pkg/roachpb/api.proto, line 2098 at r11 (raw file):

// range descriptor. Then, it removes all replicas from the range descriptor
// and adds the current node as the one designated survivor replica. This
// change is then written to meta2 and sent as a snapshot to the current

to a store local to the target node.


pkg/server/node.go, line 1006 at r11 (raw file):

	defer func() {
		if r := recover(); r != nil {
			rErr = errors.Errorf("%v", r)

?

Copy link
Contributor Author

@TheSamHuang TheSamHuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif, @tbg, and @TheSamHuang)


pkg/kv/kvserver/reset_quorum_test.go, line 20 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…

n1 and n2,

Done.


pkg/kv/kvserver/reset_quorum_test.go, line 21 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…

checks that the range

Done.


pkg/kv/kvserver/reset_quorum_test.go, line 33 at r11 (raw file):

Previously, irfansharif (irfan sharif) wrote…

[nit + arbitrary, questionable convention alert] we don't want a period after an inline comment. Could also simply elevate the comment to a proper one.

Done.


pkg/kv/kvserver/reset_quorum_test.go, line 57 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…

which ResetQuorum will be updating

Done.


pkg/kv/kvserver/reset_quorum_test.go, line 97 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This shouldn't be here any more.

Done.


pkg/kv/kvserver/store_snapshot.go, line 1026 at r9 (raw file):

Previously, irfansharif (irfan sharif) wrote…

// noopStorePool is a hollowed out StorePool that does not throttle. It's used in recovery scenarios

Done.


pkg/kv/kvserver/store_snapshot.go, line 918 at r11 (raw file):

Previously, irfansharif (irfan sharif) wrote…

eng := storage.NewDefaultInMem()

Done.


pkg/roachpb/api.proto, line 2117 at r9 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Reminder to remove including a store_id parameter here.

Done.


pkg/roachpb/api.proto, line 2095 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…

available again, at the cost of losing all of the data in the range. Any existing replica, even one residing on the target node, will irrevocably be removed.

Done.


pkg/roachpb/api.proto, line 2097 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…

a store from the target node

Done.


pkg/roachpb/api.proto, line 2098 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…

to a store local to the target node.

Done.


pkg/server/node.go, line 1039 at r9 (raw file):

Previously, irfansharif (irfan sharif) wrote…

We want a space after the "//"

Done.


pkg/server/node.go, line 1045 at r9 (raw file):

Previously, tbg (Tobias Grieger) wrote…

this is still using req.StoreID, it should pick a store from the local node.

Done.


pkg/server/node.go, line 1035 at r10 (raw file):

Previously, irfansharif (irfan sharif) wrote…

"Check that we've actually lost quorum."
"Check that we're not a metaX range"

Done.

@TheSamHuang TheSamHuang force-pushed the 41411-allow-healing-ranges-with-all-replicas-lost branch from 30cf688 to fe62de7 Compare December 1, 2020 18:17
Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: mod removing some debugging code you've added and adding the testing TODO. Thanks for taking it over the finish line, pretty happy with where it all ended up!

Reviewed 5 of 5 files at r12.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @tbg and @TheSamHuang)


pkg/kv/kvserver/reset_quorum_test.go, line 65 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Can you avoid messing with the liveness duration + sleeps if you call, on n1, IncrementEpoch

I don't think so, the IncrementEpoch operation also checks to see that the target node is non-live:

if liveness.IsLive(nl.clock.Now().GoTime()) {
return errors.Errorf("cannot increment epoch on live node: %+v", liveness)
}


pkg/server/node.go, line 1006 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…

?

Let's remove this debugging code.

@TheSamHuang TheSamHuang force-pushed the 41411-allow-healing-ranges-with-all-replicas-lost branch 3 times, most recently from 501da0c to e9f915e Compare December 1, 2020 22:55
Copy link
Contributor Author

@TheSamHuang TheSamHuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @irfansharif and @tbg)


pkg/kv/kvserver/recovery_test.go, line 178 at r9 (raw file):

Previously, tbg (Tobias Grieger) wrote…

You can close this comment if you add TODOs to your test.

Done.


pkg/kv/kvserver/reset_quorum_test.go, line 22 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…

It doesn't update meta2 any more, this is now done in ResetQuorum. But let's make this test read meta2 and check the result.

Done.


pkg/server/node.go, line 1006 at r11 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Let's remove this debugging code.

Done.

@TheSamHuang TheSamHuang force-pushed the 41411-allow-healing-ranges-with-all-replicas-lost branch 5 times, most recently from fe765df to 2ed0016 Compare December 2, 2020 06:47
Introduces and implements an `ResetQuorumRequest` RPC. `ResetQuorumRequest`
takes in the range id of a range that that is unavailable due to lost quorum
and makes it available again, at the cost of losing all of the data in that
range. Any existing replica, even one residing on the target node, will
irrevocably be removed. ResetQuorumRequest first uses meta2 to identify the
range descriptor. Then, it removes all replicas from the range descriptor and
adds a store from the target node as the one designated survivor replica. This
change is then written to meta2 and sent as a snapshot to a store local to the
target node in order to use crdb internal upreplication and rebalancing
mechanisms to create further replicas from this fresh snapshot.

This RPC is meant to be called by the user directly. It will not work for
ranges that have not lost quorum or for a meta range.

Release note: None
@TheSamHuang TheSamHuang force-pushed the 41411-allow-healing-ranges-with-all-replicas-lost branch from 2ed0016 to c3c4c66 Compare December 2, 2020 07:04
@TheSamHuang
Copy link
Contributor Author

bors r=irfansharif

@tbg
Copy link
Member

tbg commented Dec 2, 2020

wohoo!

@craig
Copy link
Contributor

craig bot commented Dec 2, 2020

Build succeeded:

@craig craig bot merged commit dbe2bc4 into cockroachdb:master Dec 2, 2020
craig bot pushed a commit that referenced this pull request Dec 3, 2020
56379: workload/schemachange: improve error screening r=ajwerner a=jayshrivastava

workload/schemachange: improve error screening

Several ops will be updated to screen for errors (#56119):
- dropTable
- dropView
- createSequence
- renameSequence
- dropSequence
- insertRow

Sequences were also updated to have more interesting cases: non-existing sequences can randomly be returned and sequences can be owned by columns.

Furthermore, this change fixes a bug where non-existing tables were getting returned instead of existing ones.

Finally, error screening logic is updated to ignore transaction retry errors. This fixes one of the issues in #56230.

57034: cli: add reset-quorum command r=knz a=TheSamHuang

This adds `reset-quorum` as a command in the debug CLI. This command utilizes the functionality added in #56333 to recover unavailable ranges where all replicas are lost. 

57272: roachtest: Add missing gossiped StoreDescriptor to expected relocate errors r=nvanbenschoten a=nvanbenschoten

Closes #57191.

This seems possible shortly after the cluster starts up, if gossip hasn't propagated by the time the `ALTER TABLE _ EXPERIMENTAL_RELOCATE _` statement is run.

57274: roachtest: remove initial disk space check from drop test r=nvanbenschoten a=nvanbenschoten

Fixes #56040.

Will need to be backported.

This assertion seemed like a good idea, but it was repeatedly (dec148a)
a source of flakiness when fixtures changed. Now that we're using IMPORT
for TPC-C, the check is even harder to get right without making it so
small as to be useless. It doesn't seem to be worth the instability, so
remove it.

57454: util/log: de-flake TestSecondaryGC and some SQL tests r=irfansharif a=knz

Fixes #57442
Fixes #57444 

See individual commits for details.

Co-authored-by: Jayant Shrivastava <[email protected]>
Co-authored-by: Sam Huang <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Raphael 'kena' Poss <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kvserver: provide an online mechanism to recover unavailable ranges
5 participants