Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The watch leader loop should exit ASAP when etcd leader changes #8823

Closed
rleungx opened this issue Nov 19, 2024 · 0 comments · Fixed by #8824
Closed

The watch leader loop should exit ASAP when etcd leader changes #8823

rleungx opened this issue Nov 19, 2024 · 0 comments · Fixed by #8824
Labels
affects-5.4 This bug affects the 5.4.x(LTS) versions. affects-6.1 This bug affects the 6.1.x(LTS) versions. affects-6.5 This bug affects the 6.5.x(LTS) versions. affects-7.1 This bug affects the 7.1.x(LTS) versions. affects-7.5 This bug affects the 7.5.x(LTS) versions. affects-8.1 This bug affects the 8.1.x(LTS) versions. affects-8.5 This bug affects the 8.5.x(LTS) versions. report/customer Customers have encountered this bug. severity/major type/bug The issue is confirmed as a bug.

Comments

@rleungx
Copy link
Member

rleungx commented Nov 19, 2024

Bug Report

pd/pkg/syncer/client.go

Lines 149 to 169 in 40fe83e

if err != nil {
if ev, ok := status.FromError(err); ok {
if ev.Code() == codes.Canceled {
return
}
}
log.Error("server failed to establish sync stream with leader", zap.String("server", s.server.Name()), zap.String("leader", s.server.GetLeader().GetName()), errs.ZapError(err))
time.Sleep(time.Second)
continue
}
log.Info("server starts to synchronize with leader", zap.String("server", s.server.Name()), zap.String("leader", s.server.GetLeader().GetName()), zap.Uint64("request-index", s.history.getNextIndex()))
for {
resp, err := stream.Recv()
if err != nil {
s.streamingRunning.Store(false)
log.Error("region sync with leader meet error", errs.ZapError(errs.ErrGRPCRecv, err))
if err = stream.CloseSend(); err != nil {
log.Error("failed to terminate client stream", errs.ZapError(errs.ErrGRPCCloseSend, err))
}
time.Sleep(time.Second)
break

Right now, we use sleep to prevent retrying to sync with the leader too frequently. However, it will indeed affect the PD election time if the etcd leader changes. The follower must wait for 1s to exit the loop before starting the election. See both line 156 and line 168. We can use the context to make it exit ASAP.

Here is the case that slows down the leader election:

  "[2024/11/14 18:56:49.246 +00:00] [INFO] [raft.go:1348] [\"ec5a786a755b23bd [term 74] received MsgTimeoutNow from 81a4031a31407385 and starts an election to get leadership.\"]",
  "[2024/11/14 18:56:49.246 +00:00] [INFO] [raft.go:719] [\"ec5a786a755b23bd became candidate at term 75\"]",
  "[2024/11/14 18:56:49.246 +00:00] [INFO] [raft.go:830] [\"ec5a786a755b23bd received MsgVoteResp from ec5a786a755b23bd at term 75\"]",
  "[2024/11/14 18:56:49.246 +00:00] [INFO] [raft.go:817] [\"ec5a786a755b23bd [logterm: 74, index: 671118265] sent MsgVote request to 6b3fd9f5a8c74ff3 at term 75\"]",
  "[2024/11/14 18:56:49.246 +00:00] [INFO] [raft.go:817] [\"ec5a786a755b23bd [logterm: 74, index: 671118265] sent MsgVote request to 81a4031a31407385 at term 75\"]",
  "[2024/11/14 18:56:49.246 +00:00] [INFO] [raft.go:817] [\"ec5a786a755b23bd [logterm: 74, index: 671118265] sent MsgVote request to ad4c4e35a9bd4181 at term 75\"]",
  "[2024/11/14 18:56:49.246 +00:00] [INFO] [raft.go:817] [\"ec5a786a755b23bd [logterm: 74, index: 671118265] sent MsgVote request to f587050a2df8fa3f at term 75\"]",
  "[2024/11/14 18:56:49.246 +00:00] [INFO] [node.go:333] [\"raft.node: ec5a786a755b23bd lost leader 81a4031a31407385 at term 75\"]",
  "[2024/11/14 18:56:49.249 +00:00] [INFO] [raft.go:830] [\"ec5a786a755b23bd received MsgVoteResp from 81a4031a31407385 at term 75\"]",
  "[2024/11/14 18:56:49.249 +00:00] [INFO] [raft.go:1295] [\"ec5a786a755b23bd has received 2 MsgVoteResp votes and 0 vote rejections\"]",
  "[2024/11/14 18:56:49.250 +00:00] [INFO] [raft.go:830] [\"ec5a786a755b23bd received MsgVoteResp from 6b3fd9f5a8c74ff3 at term 75\"]",
  "[2024/11/14 18:56:49.250 +00:00] [INFO] [raft.go:1295] [\"ec5a786a755b23bd has received 3 MsgVoteResp votes and 0 vote rejections\"]",
  "[2024/11/14 18:56:49.250 +00:00] [INFO] [raft.go:771] [\"ec5a786a755b23bd became leader at term 75\"]",
  "[2024/11/14 18:56:49.250 +00:00] [INFO] [node.go:327] [\"raft.node: ec5a786a755b23bd elected leader ec5a786a755b23bd at term 75\"]",
  "[2024/11/14 18:56:49.284 +00:00] [INFO] [leadership.go:375] [\"current leadership is deleted\"] [revision=647767897] [leader-key=/pd/7366397889337882103/leader] [purpose=\"leader election\"]",
  "[2024/11/14 18:56:49.284 +00:00] [ERROR] [client.go:161] [\"region sync with leader meet error\"] [error=\"[PD:grpc:ErrGRPCRecv]receive response error: rpc error: code = Canceled desc = context canceled\"]",
  "[2024/11/14 18:56:50.284 +00:00] [INFO] [server.go:1669] [\"pd leader has changed, try to re-campaign a pd leader\"]",
  "[2024/11/14 18:56:50.284 +00:00] [INFO] [server.go:1706] [\"start to campaign PD leader\"] [campaign-leader-name=xxx.pd-1]",
  "[2024/11/14 18:56:50.286 +00:00] [INFO] [lease.go:66] [\"lease granted\"] [lease-id=4720215488460578587] [lease-timeout=3] [purpose=\"leader election\"]",
  "[2024/11/14 18:56:50.288 +00:00] [INFO] [leadership.go:182] [\"check campaign resp\"] [resp=\"{\\\"header\\\":{\\\"cluster_id\\\":8535908639037064845,\\\"member_id\\\":17031057339485725629,\\\"revision\\\":64776
7925,\\\"raft_term\\\":75},\\\"succeeded\\\":true,\\\"responses\\\":[{\\\"Response\\\":{\\\"ResponsePut\\\":{\\\"header\\\":{\\\"revision\\\":647767925}}}}]}\"]",
  "[2024/11/14 18:56:50.288 +00:00] [INFO] [leadership.go:191] [\"write leaderData to leaderPath ok\"] [leader-key=/pd/7366397889337882103/leader] [purpose=\"leader election\"]",
  "[2024/11/14 18:56:50.288 +00:00] [INFO] [server.go:1732] [\"campaign PD leader ok\"] [campaign-leader-name=xxx.pd-1]",
@rleungx rleungx added the type/bug The issue is confirmed as a bug. label Nov 19, 2024
@ti-chi-bot ti-chi-bot bot added may-affects-5.4 may-affects-6.1 may-affects-6.5 may-affects-7.1 may-affects-7.5 may-affects-8.1 may-affects-8.5 affects-6.5 This bug affects the 6.5.x(LTS) versions. affects-8.5 This bug affects the 8.5.x(LTS) versions. affects-8.1 This bug affects the 8.1.x(LTS) versions. affects-7.5 This bug affects the 7.5.x(LTS) versions. affects-7.1 This bug affects the 7.1.x(LTS) versions. affects-6.1 This bug affects the 6.1.x(LTS) versions. affects-5.4 This bug affects the 5.4.x(LTS) versions. and removed may-affects-6.5 may-affects-8.5 may-affects-8.1 may-affects-7.5 may-affects-7.1 may-affects-6.1 may-affects-5.4 labels Nov 20, 2024
@ti-chi-bot ti-chi-bot bot closed this as completed in 41ec8dc Nov 20, 2024
ti-chi-bot pushed a commit to ti-chi-bot/pd that referenced this issue Nov 20, 2024
ti-chi-bot pushed a commit to ti-chi-bot/pd that referenced this issue Nov 20, 2024
ti-chi-bot pushed a commit to ti-chi-bot/pd that referenced this issue Nov 20, 2024
ti-chi-bot pushed a commit to ti-chi-bot/pd that referenced this issue Nov 20, 2024
ti-chi-bot pushed a commit to ti-chi-bot/pd that referenced this issue Nov 20, 2024
ti-chi-bot pushed a commit to ti-chi-bot/pd that referenced this issue Nov 20, 2024
ti-chi-bot bot pushed a commit that referenced this issue Dec 3, 2024
close #8823

Signed-off-by: ti-chi-bot <[email protected]>
Signed-off-by: Ryan Leung <[email protected]>

Co-authored-by: Ryan Leung <[email protected]>
@ti-chi-bot ti-chi-bot bot added the report/customer Customers have encountered this bug. label Dec 4, 2024
ti-chi-bot bot pushed a commit that referenced this issue Dec 6, 2024
rleungx added a commit to ti-chi-bot/pd that referenced this issue Dec 6, 2024
ti-chi-bot bot added a commit that referenced this issue Dec 6, 2024
close #8823

Signed-off-by: Ryan Leung <[email protected]>

Co-authored-by: Ryan Leung <[email protected]>
Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
rleungx added a commit to rleungx/pd that referenced this issue Dec 6, 2024
ti-chi-bot bot pushed a commit that referenced this issue Dec 6, 2024
rleungx added a commit to ti-chi-bot/pd that referenced this issue Dec 11, 2024
ti-chi-bot bot added a commit that referenced this issue Dec 11, 2024
close #8823

Signed-off-by: ti-chi-bot <[email protected]>
Signed-off-by: Ryan Leung <[email protected]>

Co-authored-by: Ryan Leung <[email protected]>
Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-5.4 This bug affects the 5.4.x(LTS) versions. affects-6.1 This bug affects the 6.1.x(LTS) versions. affects-6.5 This bug affects the 6.5.x(LTS) versions. affects-7.1 This bug affects the 7.1.x(LTS) versions. affects-7.5 This bug affects the 7.5.x(LTS) versions. affects-8.1 This bug affects the 8.1.x(LTS) versions. affects-8.5 This bug affects the 8.5.x(LTS) versions. report/customer Customers have encountered this bug. severity/major type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant