Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

qps drop to zero during pd leader io hang with error code 9001 #6889

Closed
Lily2025 opened this issue Aug 3, 2023 · 5 comments · Fixed by #6961
Closed

qps drop to zero during pd leader io hang with error code 9001 #6889

Lily2025 opened this issue Aug 3, 2023 · 5 comments · Fixed by #6961
Assignees
Labels
severity/critical type/bug The issue is confirmed as a bug.

Comments

@Lily2025
Copy link

Lily2025 commented Aug 3, 2023

Bug Report

What did you do?

1、run tpcc with 1000 warehouse,10 thread
2、inject pd leader io hang last for 10mins
inject fault:2023-08-02 20:13:14
recover fault:2023-08-02 20:18:14

What did you expect to see?

qps can recover whitin 2mins

What did you see instead?

qps drop to zero during pd leader io hang with error code 9001
image

What version of PD are you using (pd-server -V)?

./pd-server -V
Release Version: v7.3.0-alpha
Edition: Community
Git Commit Hash: 0c537bb
Git Branch: heads/refs/tags/v7.3.0-alpha
UTC Build Time: 2023-08-01 11:37:27
2023-08-02T18:02:53.544+0800

@Lily2025 Lily2025 added the type/bug The issue is confirmed as a bug. label Aug 3, 2023
@Lily2025
Copy link
Author

Lily2025 commented Aug 3, 2023

/type bug
/severity critical
/assign lhy1024

@lhy1024
Copy link
Contributor

lhy1024 commented Aug 3, 2023

pd client always try to connect the pd server which has been io hang, I will investigate it.

[2023/08/02 20:13:32.895 +08:00] [ERROR] [tso_dispatcher.go:487] ["[tso] getTS error"] [dc-location=global] [stream-addr=http://tc-pd-1.tc-pd-peer.endless-ha-test-tps-1831623-1-560.svc:2379] [error="[PD:client:ErrClientGetTSO]get TSO failed, after processing requests"]
[2023/08/02 20:13:32.895 +08:00] [WARN] [txn.go:693] ["wait tso failed"] [error="rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster"]
[2023/08/02 20:13:32.896 +08:00] [ERROR] [tso_dispatcher.go:487] ["[tso] getTS error"] [dc-location=global] [stream-addr=http://tc-pd-1.tc-pd-peer.endless-ha-test-tps-1831623-1-560.svc:2379] [error="[PD:client:ErrClientGetTSO]get TSO failed, after processing requests"]
[2023/08/02 20:13:32.919 +08:00] [INFO] [resource_manager_client.go:362] ["[resource_manager] token request error"] [error="rpc error: code = Unavailable desc = not leader"]
[2023/08/02 20:13:32.919 +08:00] [INFO] [controller.go:443] ["[resource group controller] token bucket rpc error: rpc error: code = Unavailable desc = not leader"]
[2023/08/02 20:13:32.961 +08:00] [ERROR] [tso_dispatcher.go:487] ["[tso] getTS error"] [dc-location=global] [stream-addr=http://tc-pd-1.tc-pd-peer.endless-ha-test-tps-1831623-1-560.svc:2379] [error="[PD:client:ErrClientGetTSO]get TSO failed, after processing requests"]
[2023/08/02 20:13:32.974 +08:00] [ERROR] [tso_dispatcher.go:487] ["[tso] getTS error"] [dc-location=global] [stream-addr=http://tc-pd-1.tc-pd-peer.endless-ha-test-tps-1831623-1-560.svc:2379] [error="[PD:client:ErrClientGetTSO]get TSO failed, after processing requests"]

image
Image

@lhy1024
Copy link
Contributor

lhy1024 commented Aug 10, 2023

  1. pd transfer leader successfully

pd-1: leader->follower

[2023/08/02 20:13:21.139 +08:00] [INFO] [raft.go:706] ["46d11cd648086bca became follower at term 7"]
[2023/08/02 20:13:31.145 +08:00] [INFO] [server.go:1644] ["no longer a leader because lease has expired, pd leader will step down"]
[2023/08/02 20:13:42.145 +08:00] [INFO] [server.go:1502] ["start to watch pd leader"] [pd-leader="name:\"tc-pd-2\" member_id:13297815749536988573 peer_urls:\"http://tc-pd-2.tc-pd-peer.endless-ha-test-tps-1831623-1-560.svc:2380\" client_urls:\"http://tc-pd-2.tc-pd-peer.endless-ha-test-tps-1831623-1-560.svc:2379\" "]

pd-2: follower->leader

[2023/08/02 20:13:21.161 +08:00] [INFO] [node.go:327] ["raft.node: b88b54f827bf3d9d elected leader b88b54f827bf3d9d at term 7"]
[2023/08/02 20:13:27.681 +08:00] [INFO] [etcdutil.go:298] ["[etcd client] update endpoints"] [num-change=3->2] [last-endpoints="[http://tc-pd-1.tc-pd-peer.endless-ha-test-tps-1831623-1-560.svc:2379,http://tc-pd-2.tc-pd-peer.endless-ha-test-tps-1831623-1-560.svc:2379,http://tc-pd-0.tc-pd-peer.endless-ha-test-tps-1831623-1-560.svc:2379]"] [endpoints="[http://tc-pd-2.tc-pd-peer.endless-ha-test-tps-1831623-1-560.svc:2379,http://tc-pd-0.tc-pd-peer.endless-ha-test-tps-1831623-1-560.svc:2379]"]
[2023/08/02 20:13:27.682 +08:00] [INFO] [etcdutil.go:298] ["[etcd client] update endpoints"] [num-change=3->2] [last-endpoints="[http://tc-pd-2.tc-pd-peer.endless-ha-test-tps-1831623-1-560.svc:2379,http://tc-pd-1.tc-pd-peer.endless-ha-test-tps-1831623-1-560.svc:2379,http://tc-pd-0.tc-pd-peer.endless-ha-test-tps-1831623-1-560.svc:2379]"] [endpoints="[http://tc-pd-2.tc-pd-peer.endless-ha-test-tps-1831623-1-560.svc:2379,http://tc-pd-0.tc-pd-peer.endless-ha-test-tps-1831623-1-560.svc:2379]"]
[2023/08/02 20:13:32.169 +08:00] [INFO] [server.go:1506] ["pd leader has changed, try to re-campaign a pd leader"]
[2023/08/02 20:13:32.169 +08:00] [INFO] [server.go:1543] ["start to campaign PD leader"] [campaign-leader-name=tc-pd-2]
[2023/08/02 20:13:32.171 +08:00] [INFO] [server.go:1569] ["campaign PD leader ok"] [campaign-leader-name=tc-pd-2]
[2023/08/02 20:13:32.171 +08:00] [INFO] [server.go:1577] ["initializing the global TSO allocator"]
[2023/08/02 20:13:32.186 +08:00] [INFO] [server.go:1635] ["PD leader is ready to serve"] [leader-name=tc-pd-2]
  1. but both two tidb meet 9001 error
    image

@lhy1024
Copy link
Contributor

lhy1024 commented Aug 15, 2023

Another strange phenomenon, tikv also always chooses pd-1 instead of pd-2, which has become the leader, until the end of the injection delay

[2023/08/02 20:13:32.775 +08:00] [INFO] [util.rs:637] ["connecting to PD endpoint"] [endpoints=http://tc-pd-0.tc-pd-peer.endless-ha-test-tps-1831623-1-560.svc:2379]
[2023/08/02 20:13:32.777 +08:00] [INFO] [util.rs:637] ["connecting to PD endpoint"] [endpoints=http://tc-pd-1.tc-pd-peer.endless-ha-test-tps-1831623-1-560.svc:2379]
[2023/08/02 20:13:32.779 +08:00] [INFO] [util.rs:806] ["connected to PD member"] [endpoints=http://tc-pd-1.tc-pd-peer.endless-ha-test-tps-1831623-1-560.svc:2379]
[2023/08/02 20:13:32.779 +08:00] [INFO] [util.rs:246] ["heartbeat sender and receiver are stale, refreshing ..."]
[2023/08/02 20:13:32.779 +08:00] [INFO] [util.rs:259] ["buckets sender and receiver are stale, refreshing ..."]
[2023/08/02 20:13:32.779 +08:00] [INFO] [util.rs:278] ["acquire_token_buckets sender and receiver are stale, refreshing ..."]
[2023/08/02 20:13:32.780 +08:00] [INFO] [tso.rs:162] ["TSO worker terminated"] [receiver_cause=None] [sender_cause=None]
[2023/08/02 20:13:32.780 +08:00] [INFO] [client.rs:153] ["TSO stream is closed, reconnect to PD"]
[2023/08/02 20:13:32.781 +08:00] [INFO] [util.rs:301] ["update pd client"] [via=] [leader=http://tc-pd-1.tc-pd-peer.endless-ha-test-tps-1831623-1-560.svc:2379] [prev_via=] [prev_leader=http://tc-pd-1.tc-pd-peer.endless-ha-test-tps-1831623-1-560.svc:2379]
[2023/08/02 20:13:32.781 +08:00] [INFO] [util.rs:433] ["trying to update PD client done"] [spend=6.096379ms]

tikv retry

[2023/08/02 20:18:17.547 +08:00] [INFO] [util.rs:637] ["connecting to PD endpoint"] [endpoints=http://tc-pd-0.tc-pd-peer.endless-ha-test-tps-1831623-1-560.svc:2379]
[2023/08/02 20:18:17.549 +08:00] [INFO] [util.rs:637] ["connecting to PD endpoint"] [endpoints=http://tc-pd-2.tc-pd-peer.endless-ha-test-tps-1831623-1-560.svc:2379]
[2023/08/02 20:18:17.551 +08:00] [INFO] [util.rs:637] ["connecting to PD endpoint"] [endpoints=http://tc-pd-2.tc-pd-peer.endless-ha-test-tps-1831623-1-560.svc:2379]
[2023/08/02 20:18:17.553 +08:00] [INFO] [util.rs:806] ["connected to PD member"] [endpoints=http://tc-pd-2.tc-pd-peer.endless-ha-test-tps-1831623-1-560.svc:2379]
[2023/08/02 20:18:17.553 +08:00] [INFO] [util.rs:246] ["heartbeat sender and receiver are stale, refreshing ..."]
[2023/08/02 20:18:17.553 +08:00] [INFO] [util.rs:259] ["buckets sender and receiver are stale, refreshing ..."]
[2023/08/02 20:18:17.553 +08:00] [ERROR] [client.rs:743] ["failed to send heartbeat"] [err_code=KV:Pd:Grpc] [err="Grpc(RpcFinished(Some(RpcStatus { code: 0-OK, message: \"\", details: [] })))"]
[2023/08/02 20:18:17.553 +08:00] [INFO] [util.rs:278] ["acquire_token_buckets sender and receiver are stale, refreshing ..."]
[2023/08/02 20:18:17.555 +08:00] [INFO] [util.rs:301] ["update pd client"] [via=] [leader=http://tc-pd-2.tc-pd-peer.endless-ha-test-tps-1831623-1-560.svc:2379] [prev_via=] [prev_leader=http://tc-pd-1.tc-pd-peer.endless-ha-test-tps-1831623-1-560.svc:2379]
[2023/08/02 20:18:17.555 +08:00] [INFO] [util.rs:433] ["trying to update PD client done"] [spend=7.558725ms]

@lhy1024
Copy link
Contributor

lhy1024 commented Aug 15, 2023

but pd-0 didn't exit watch loop in time, although etcd client has updated.

pd-0: follower->follower

[2023/08/02 20:13:21.140 +08:00] [INFO] [raft.go:706] ["b66471c7c53de25b became follower at term 7"]
[2023/08/02 20:13:21.140 +08:00] [INFO] [raft.go:966] ["b66471c7c53de25b [logterm: 6, index: 10562, vote: 0] cast MsgVote for b88b54f827bf3d9d [logterm: 6, index: 10562] at term 7"]
[2023/08/02 20:13:21.141 +08:00] [INFO] [node.go:327] ["raft.node: b66471c7c53de25b elected leader b88b54f827bf3d9d at term 7"]
[2023/08/02 20:13:30.628 +08:00] [INFO] [etcdutil.go:298] ["[etcd client] update endpoints"] [num-change=3->2] [last-endpoints="[http://tc-pd-1.tc-pd-peer.endless-ha-test-tps-1831623-1-560.svc:2379,http://tc-pd-2.tc-pd-peer.endless-ha-test-tps-1831623-1-560.svc:2379,http://tc-pd-0.tc-pd-peer.endless-ha-test-tps-1831623-1-560.svc:2379]"] [endpoints="[http://tc-pd-2.tc-pd-peer.endless-ha-test-tps-1831623-1-560.svc:2379,http://tc-pd-0.tc-pd-peer.endless-ha-test-tps-1831623-1-560.svc:2379]"]
[2023/08/02 20:13:30.631 +08:00] [INFO] [etcdutil.go:298] ["[etcd client] update endpoints"] [num-change=3->2] [last-endpoints="[http://tc-pd-1.tc-pd-peer.endless-ha-test-tps-1831623-1-560.svc:2379,http://tc-pd-2.tc-pd-peer.endless-ha-test-tps-1831623-1-560.svc:2379,http://tc-pd-0.tc-pd-peer.endless-ha-test-tps-1831623-1-560.svc:2379]"] [endpoints="[http://tc-pd-2.tc-pd-peer.endless-ha-test-tps-1831623-1-560.svc:2379,http://tc-pd-0.tc-pd-peer.endless-ha-test-tps-1831623-1-560.svc:2379]"]
....
[2023/08/02 20:18:17.519 +08:00] [INFO] [etcdutil.go:298] ["[etcd client] update endpoints"] [num-change=2->3] [last-endpoints="[http://tc-pd-2.tc-pd-peer.endless-ha-test-tps-1831623-1-560.svc:2379,http://tc-pd-0.tc-pd-peer.endless-ha-test-tps-1831623-1-560.svc:2379]"] [endpoints="[http://tc-pd-2.tc-pd-peer.endless-ha-test-tps-1831623-1-560.svc:2379,http://tc-pd-0.tc-pd-peer.endless-ha-test-tps-1831623-1-560.svc:2379,http://tc-pd-1.tc-pd-peer.endless-ha-test-tps-1831623-1-560.svc:2379]"]
[2023/08/02 20:18:17.519 +08:00] [INFO] [etcdutil.go:298] ["[etcd client] update endpoints"] [num-change=2->3] [last-endpoints="[http://tc-pd-2.tc-pd-peer.endless-ha-test-tps-1831623-1-560.svc:2379,http://tc-pd-0.tc-pd-peer.endless-ha-test-tps-1831623-1-560.svc:2379]"] [endpoints="[http://tc-pd-0.tc-pd-peer.endless-ha-test-tps-1831623-1-560.svc:2379,http://tc-pd-2.tc-pd-peer.endless-ha-test-tps-1831623-1-560.svc:2379,http://tc-pd-1.tc-pd-peer.endless-ha-test-tps-1831623-1-560.svc:2379]"]
...
[2023/08/02 20:18:18.458 +08:00] [INFO] [server.go:1506] ["pd leader has changed, try to re-campaign a pd leader"]
[2023/08/02 20:18:18.458 +08:00] [INFO] [server.go:1531] ["skip campaigning of pd leader and check later"] [server-name=tc-pd-0] [etcd-leader-id=13297815749536988573] [member-id=13142754715395547739]
[2023/08/02 20:18:18.661 +08:00] [INFO] [server.go:1706] ["server enable region storage"]
[2023/08/02 20:18:18.661 +08:00] [INFO] [server.go:1502] ["start to watch pd leader"] [pd-leader="name:\"tc-pd-2\" member_id:13297815749536988573 peer_urls:\"http://tc-pd-2.tc-pd-peer.endless-ha-test-tps-1831623-1-560.svc:2380\" client_urls:\"http://tc-pd-2.tc-pd-peer.endless-ha-test-tps-1831623-1-560.svc:2379\" "]

pd client will get old leader and directly return when connected pd-0, so pd client cannot update member normally.

func (c *pdServiceDiscovery) updateMember() error {
	for i, url := range c.GetServiceURLs() {
		failpoint.Inject("skipFirstUpdateMember", func() {
			if i == 0 {
				failpoint.Continue()
			}
		})

		members, err := c.getMembers(c.ctx, url, updateMemberTimeout)
		// Check the cluster ID.
		if err == nil && members.GetHeader().GetClusterId() != c.clusterID {
			err = errs.ErrClientUpdateMember.FastGenByArgs("cluster id does not match")
		}
		// Check the TSO Allocator Leader.
		var errTSO error
		if err == nil {
			if members.GetLeader() == nil || len(members.GetLeader().GetClientUrls()) == 0 {
				err = errs.ErrClientGetLeader.FastGenByArgs("leader address doesn't exist")
			}
			// Still need to update TsoAllocatorLeaders, even if there is no PD leader
			errTSO = c.switchTSOAllocatorLeaders(members.GetTsoAllocatorLeaders())
		}

		// Failed to get members
		if err != nil {
			log.Info("[pd] cannot update member from this address",
				zap.String("address", url),
				errs.ZapError(err))
			select {
			case <-c.ctx.Done():
				return errors.WithStack(err)
			default:
				continue
			}
		}

		c.updateURLs(members.GetMembers())
		c.updateFollowers(members.GetMembers(), members.GetLeader())
		if err := c.switchLeader(members.GetLeader().GetClientUrls()); err != nil {
			return err
		}

		// If `switchLeader` succeeds but `switchTSOAllocatorLeader` has an error,
		// the error of `switchTSOAllocatorLeader` will be returned.
		return errTSO
	}
	return errs.ErrClientGetMember.FastGenByArgs()
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
severity/critical type/bug The issue is confirmed as a bug.
Projects
Development

Successfully merging a pull request may close this issue.

2 participants