Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect leader health and automatically do failover #6403

Closed
nolouch opened this issue May 4, 2023 · 1 comment · Fixed by #6447
Closed

Detect leader health and automatically do failover #6403

nolouch opened this issue May 4, 2023 · 1 comment · Fixed by #6447
Assignees
Labels
affects-6.5 This bug affects the 6.5.x(LTS) versions. affects-7.1 This bug affects the 7.1.x(LTS) versions. severity/major type/bug The issue is confirmed as a bug. type/enhancement The issue or PR belongs to an enhancement.

Comments

@nolouch
Copy link
Contributor

nolouch commented May 4, 2023

Feature Request

Describe your feature request related problem

PD will bind the PD leader to the etcd leader to reduce the burden of understanding for users. based on this behavior, here meet a problem that the PD leader lease lost but etcd leader doesn't lose. and then the previous PD cannot elect as leader again because of some problems with the leader election. we can simulate the problem like dropping all pockets coming from this connection.

sudo iptables -A INPUT -p tcp --sport 44438 -j DROP

While etcd raft heartbeat to keep etcd leadership goes to other nodes, the PD leader lease keepalive goes directly to the local peer advertise address, so completely different connections. In this case, the PD leader lost but other followers cannot elect a new leader due to the etcd leader still being in the old one, then PD cannot serve, and the cluster is unavailable for a long time until etcd leader be changed.

Describe the feature you'd like

Reduce the unavailable time of the cluster.

Describe alternatives solutions you've considered

PD Leader health detect

Because all followers watch the pd leader's key, so actually all members know who is the leader. we can store the leader-member id and the update time in the memory of all members. once the leader key lease is lost, the leader key will be deleted because the lease expired, then all members should know it by watching the key, then clear the leader id and record the upated time and reset it until the new leader is elected.

the leader record struct like:

type leaderEvent struct {
   leader  *pdpb.Member
   updatedTime time.Time
}

and members can watch the leader key and do relatively handle for it here:

pd/server/server.go

Lines 1428 to 1430 in 46fdd96

log.Info("start to watch pd leader", zap.Stringer("pd-leader", leader))
// WatchLeader will keep looping and never return unless the PD leader has changed.
leader.Watch(s.serverLoopCtx)

Resign etcd leader if no pd leader for a long time

After knowing the pd leader and updated time, pd members can decide to do a failover with let etcd to do a new election base on the lost time. we can do this logic on this:

pd/server/server.go

Lines 1567 to 1582 in 46fdd96

func (s *Server) etcdLeaderLoop() {
defer logutil.LogPanic()
defer s.serverLoopWg.Done()
ctx, cancel := context.WithCancel(s.serverLoopCtx)
defer cancel()
for {
select {
case <-time.After(s.cfg.LeaderPriorityCheckInterval.Duration):
s.member.CheckPriority(ctx)
case <-ctx.Done():
log.Info("server is closed, exit etcd leader loop")
return
}
}
}

once we detect there has etcd leader but no pd leader for a long time(such as 10 * etcdElectionTimeout), we can let the first follower member, with sorted by member id, to do an etcd re-election, the interface can use

func (m *EmbeddedEtcdMember) MoveEtcdLeader(ctx context.Context, old, new uint64) error {

ETA

a week fix on master

@nolouch nolouch added the type/feature-request Categorizes issue or PR as related to a new feature. label May 4, 2023
@nolouch nolouch self-assigned this May 5, 2023
@nolouch nolouch added type/bug The issue is confirmed as a bug. type/enhancement The issue or PR belongs to an enhancement. and removed type/feature-request Categorizes issue or PR as related to a new feature. labels May 5, 2023
@nolouch nolouch added affects-6.5 This bug affects the 6.5.x(LTS) versions. affects-7.1 This bug affects the 7.1.x(LTS) versions. severity/major labels May 5, 2023
ti-chi-bot bot pushed a commit that referenced this issue May 12, 2023
ti-chi-bot pushed a commit to ti-chi-bot/pd that referenced this issue May 12, 2023
ti-chi-bot pushed a commit to ti-chi-bot/pd that referenced this issue May 12, 2023
ti-chi-bot bot added a commit that referenced this issue May 12, 2023
…d leader intact (#6447)

close #6403

server: fix the leader cannot election after pd leader lost while etcd leader intact

Signed-off-by: nolouch <[email protected]>

Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
ti-chi-bot pushed a commit to ti-chi-bot/pd that referenced this issue May 12, 2023
ti-chi-bot pushed a commit to ti-chi-bot/pd that referenced this issue May 12, 2023
ti-chi-bot bot added a commit that referenced this issue May 15, 2023
…d leader intact (#6447) (#6461)

close #6403, ref #6447

server: fix the leader cannot election after pd leader lost while etcd leader intact

Signed-off-by: ti-chi-bot <[email protected]>
Signed-off-by: nolouch <[email protected]>

Co-authored-by: ShuNing <[email protected]>
Co-authored-by: nolouch <[email protected]>
Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
ti-chi-bot bot added a commit that referenced this issue May 15, 2023
ref #6403, ref #6409

Signed-off-by: ti-chi-bot <[email protected]>
Signed-off-by: Ryan Leung <[email protected]>

Co-authored-by: Ryan Leung <[email protected]>
Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
ti-chi-bot bot added a commit that referenced this issue May 24, 2023
…d leader intact (#6447) (#6460)

close #6403, ref #6447

server: fix the leader cannot election after pd leader lost while etcd leader intact

Signed-off-by: ti-chi-bot <[email protected]>
Signed-off-by: nolouch <[email protected]>

Co-authored-by: ShuNing <[email protected]>
Co-authored-by: nolouch <[email protected]>
Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
ti-chi-bot bot added a commit that referenced this issue May 24, 2023
ref #6403, ref #6409

Signed-off-by: ti-chi-bot <[email protected]>
Signed-off-by: Ryan Leung <[email protected]>

Co-authored-by: Ryan Leung <[email protected]>
Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
@nolouch
Copy link
Contributor Author

nolouch commented Aug 25, 2023

ref #6403, ref #6556

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-6.5 This bug affects the 6.5.x(LTS) versions. affects-7.1 This bug affects the 7.1.x(LTS) versions. severity/major type/bug The issue is confirmed as a bug. type/enhancement The issue or PR belongs to an enhancement.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant