Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] add mayPromote check #19

Merged
merged 1 commit into from
May 7, 2019
Merged

Conversation

WIZARD-CXY
Copy link

@WIZARD-CXY WIZARD-CXY commented Apr 16, 2019

Below is a WIP mayPromote check. It now compares leader's Match and learner's Match to see if learner has caught with the leader. It implements the design now, but it may have some flaws:

1 Since we plan to do mayPromote check in etcdserver, so it needs the id exists in the cluster, but we check if member which user provides exists later in the validation apply phase. Ideally it should check member if exists, then if member is a learner, finally check if it is ready. But the order now is wrong.

2 IsReadyToPromoteMember borrows idea from IsReadyToAddMember. Do we need more check?

3 Does check Match ok?

Looking for your suggestion @jingyih @xiang90

@jingyih
Copy link
Owner

jingyih commented Apr 16, 2019

Some thoughts:
1.

Since we plan to do mayPromote check in etcdserver, so it needs the id exists in the cluster, but we check if member which user provides exists later in the validation apply phase. Ideally it should check member if exists, then if member is a learner, finally check if it is ready. But the order now is wrong.

Looks like we need to check if member exist, if member is learner and its progress before sending the request to raft. This will cause race, and maybe we will have to live with it. I do not think we can do the progress check in apply stage, because A) only leader has the information, and B) whether or not a config change is applied should NOT depend on storage state.

IsReadyToPromoteMember borrows idea from IsReadyToAddMember. Do we need more check?

IsReadyToAddMember and other similar checks under StrictReconfigCheck needs to be updated since we added learner concept in etcd. See #17. I will update them hopefully later this week. IsReadyToPromoteMember is probably similar to the current IsReadyToAddMember, because promoting is adding a voting member.

Does check Match ok?

Conceptually we want to compare learner's committed index with leader's committed index. Learner's committed index is basically the Match. I am not sure if / how leader's own Match is updated. Alternatively, use

func (s *EtcdServer) CommittedIndex() uint64 { return s.getCommittedIndex() }

@WIZARD-CXY
Copy link
Author

WIZARD-CXY commented Apr 25, 2019

@jingyih I talked with xiang90 about our questions. He said ok about this implementation now.Below is update:
1 I changed the implementaion of mayPromote check. Now it let memberNotExist and memberNotLearner pass through, they will be checked later in the apply validation phase as the already are. I think this might just work:
A promote non-exist member, can check it out
B promote non-learner member, can checkout it out
C promote not ready learner, can checkout it out
D add a learner and soon promote it, can checkou it out

2 we can have a seperate pr for the IsReadyToPromoteMember check.

Copy link
Owner

@jingyih jingyih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the new changes! Given that etcd only allows one pending config change at a time, I think in most cases this will be fine.

etcdserver/server.go Outdated Show resolved Hide resolved
etcdserver/server.go Outdated Show resolved Hide resolved
@jingyih
Copy link
Owner

jingyih commented Apr 25, 2019

Let's also add some helper function on client side to find leader before sending the promote request. Do you want me to work on this?

@WIZARD-CXY
Copy link
Author

Let's also add some helper function on client side to find leader before sending the promote request. Do you want me to work on this?

How about client loop through all the endpoints and retry until we find a endpoint that do not return notleader error? it is low cost and simple.

@jingyih
Copy link
Owner

jingyih commented Apr 26, 2019

Let's also add some helper function on client side to find leader before sending the promote request. Do you want me to work on this?

How about client loop through all the endpoints and retry until we find a endpoint that do not return notleader error? it is low cost and simple.

Correct me if I am wrong:

  1. Based on my reading, looks like once the clientv3 is created (with endpoints as config parameter), you can call clientv3.MemberPromote() but have no control over which endpoint it will use to send the request?

  2. Also, not returning NotLeader does not necessarily mean the endpoint is leader? I think there are other types of errors that can be returned before the request reaches mayPromoteMember. Any error in transport layer (such as TLS errors) and gRPC layer (such as any error returned in this function [1]) can be returned before the request reaches mayPromoteMember. Therefore, getting an error that is not 'NotLeader' does not mean the endpoint is leader.

Ref:
[1]

func newUnaryInterceptor(s *etcdserver.EtcdServer) grpc.UnaryServerInterceptor {

@WIZARD-CXY
Copy link
Author

hmm... you plan to do it on the server side ?

@jingyih
Copy link
Owner

jingyih commented Apr 26, 2019

hmm... you plan to do it on the server side ?

I do not have a full solution yet. I am think about fixing etcdctl member promote first. At least there we can call endpoint status to find the leader, then create clientv3 with leader endpoint and send promote request. Maybe this is good enough for v3.4. I can send a quick PR to gather some feedback.

@jingyih
Copy link
Owner

jingyih commented Apr 26, 2019

BTW, with this change, we should also update TestMemberPromoteForLearner.

@WIZARD-CXY WIZARD-CXY force-pushed the mayPromote branch 2 times, most recently from fa08273 to 7e0f2be Compare April 27, 2019 08:47
@WIZARD-CXY
Copy link
Author

WIZARD-CXY commented Apr 27, 2019

@jingyih changed, ptal.
I changed the TestMemberPromoteForLearner to test not ready learner for now. I don't know how to start the new added learner node so it can catch up with leader.

@WIZARD-CXY WIZARD-CXY force-pushed the mayPromote branch 4 times, most recently from 1a9ebf9 to 43d85f8 Compare April 28, 2019 08:30
etcdserver/server.go Outdated Show resolved Hide resolved
etcdserver/server.go Show resolved Hide resolved
clientv3/integration/cluster_test.go Outdated Show resolved Hide resolved
@jingyih
Copy link
Owner

jingyih commented Apr 29, 2019

@jpbetz Could you also take a look?

Copy link
Owner

@jingyih jingyih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is almost ready to merge. Once this is merged, I will merge #31. Then we completed the solution for learner progress check.

etcdserver/server.go Show resolved Hide resolved
@WIZARD-CXY
Copy link
Author

@jingyih a integration test failed

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x60 pc=0xb0184d]
goroutine 23513 [running]:
go.etcd.io/etcd/vendor/google.golang.org/grpc.(*csAttempt).sendMsg(0xc00018eb00, 0x14c9c00, 0xc0016fa468, 0xc0016fa47a, 0x5, 0x5, 0xc0016fa470, 0xa, 0xa, 0xc0016fa470, ...)
	/go/src/go.etcd.io/etcd/vendor/google.golang.org/grpc/stream.go:713 +0x14d
go.etcd.io/etcd/vendor/google.golang.org/grpc.(*clientStream).SendMsg.func2(0xc00018eb00, 0x48, 0x48)
	/go/src/go.etcd.io/etcd/vendor/google.golang.org/grpc/stream.go:637 +0x186
go.etcd.io/etcd/vendor/google.golang.org/grpc.(*clientStream).withRetry(0xc0001a27e0, 0xc003c2d400, 0xc002431d40, 0xc0016fa47a, 0x5)
	/go/src/go.etcd.io/etcd/vendor/google.golang.org/grpc/stream.go:530 +0x431
go.etcd.io/etcd/vendor/google.golang.org/grpc.(*clientStream).SendMsg(0xc0001a27e0, 0x14c9c00, 0xc0016fa468, 0x0, 0x0)
	/go/src/go.etcd.io/etcd/vendor/google.golang.org/grpc/stream.go:643 +0x80c
go.etcd.io/etcd/etcdserver/etcdserverpb.(*leaseLeaseKeepAliveClient).Send(0xc001900de0, 0xc0016fa468, 0x0, 0x0)
	/go/src/go.etcd.io/etcd/etcdserver/etcdserverpb/rpc.pb.go:3820 +0x6b
go.etcd.io/etcd/clientv3.(*lessor).sendKeepAliveLoop(0xc0007125a0, 0x17e0000, 0xc001900de0)
	/go/src/go.etcd.io/etcd/clientv3/lease.go:573 +0x2fb
created by go.etcd.io/etcd/clientv3.(*lessor).resetRecv
	/go/src/go.etcd.io/etcd/clientv3/lease.go:489 +0x34b
FAIL	go.etcd.io/etcd/clientv3/integration	133.028s

@jingyih
Copy link
Owner

jingyih commented May 6, 2019

It is a flaky test, unrelated to this PR. See etcd-io#10700 for reference.

@WIZARD-CXY
Copy link
Author

@jingyih I plan to merge this and add more test in next pr. WDYT

@WIZARD-CXY
Copy link
Author

On second thought, we must figure out how to fake the raftStatus(), or it will be hard to test the ready or not ready learner

@jingyih
Copy link
Owner

jingyih commented May 7, 2019

Merging this PR. We'll figure out how to fake raftStatus() in server unit test later.

@jingyih jingyih merged commit bd7ad1a into jingyih:learner May 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants