pd-tso-bench: client updateMember can't recover after deleting all PD/API pods w/o graceful period. #6681

binshi-bing · 2023-06-26T19:34:57Z

Enhancement Task

What did I do?

In dev, run:
./pd-tso-bench -v -duration 250000s -pd "http://serverless-cluster-pd-0.serverless-cluster-pd-peer.tidb-serverless.svc:2379" -client 1 -c 1 -interval 10s

Kill all PD/API pods at 11:59:18 PDT
~  kubectl delete pod serverless-cluster-pd-0 serverless-cluster-pd-1 serverless-cluster-pd-2 -n tidb-serverless --force --grace-period=0
Warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "serverless-cluster-pd-0" force deleted
pod "serverless-cluster-pd-1" force deleted
pod "serverless-cluster-pd-2" force deleted
~  date  ✔  10376  11:59:17
Mon Jun 26 11:59:18 PDT 2023

pod 0 started at 11:59:25 PDT and ready to serve at 12:00:03 PDT
starting pd-server ...
/pd-server --data-dir=/var/lib/pd --name=serverless-cluster-pd-0 --peer-urls=http://0.0.0.0:2380 --advertise-peer-urls=http://serverless-cluster-pd-0.serverless-cluster-pd-peer.tidb-serverless.svc:2380 --client-urls=http://0.0.0.0:2379 --advertise-client-urls=http://serverless-cluster-pd-0.serverless-cluster-pd-peer.tidb-serverless.svc:2379 --config=/etc/pd/pd.toml --join=http://serverless-cluster-pd-0.serverless-cluster-pd-peer.tidb-serverless.svc:2380,http://serverless-cluster-pd-1.serverless-cluster-pd-peer.tidb-serverless.svc:2380,http://serverless-cluster-pd-2.serverless-cluster-pd-peer.tidb-serverless.svc:2380
[2023/06/26 18:59:25.773 +00:00] [INFO] [versioninfo.go:89] ["Welcome to Placement Driver (API SERVICE)"]
...
[2023/06/26 19:00:03.123 +00:00] [INFO] [manager.go:74] ["Key visual service is started"]

PD client updateMember can't recover.
Check log here https://gist.githubusercontent.com/binshi-bing/d669ed80e48073f4923c51b29ce95642/raw/7b339f6c319333453e9e17dc136393f4a551a5ec/gistfile1.txt

lhy1024 · 2023-06-27T05:03:27Z

maybe we need to add grpc keepalive params in pd-tso-bench

lhy1024 · 2023-06-27T07:36:23Z

It seems the bug is from grpc grpc/grpc-go#4785

when the api server is restarted, the channel connectivity go into TRANSIENT_FAILURE

rleungx · 2023-06-28T02:13:01Z

It seems the bug is from grpc grpc/grpc-go#4785

when the api server is restarted, the channel connectivity go into TRANSIENT_FAILURE

Does it happen on the client side or the server side?

lhy1024 · 2023-06-28T03:19:30Z

It seems the bug is from grpc grpc/grpc-go#4785
when the api server is restarted, the channel connectivity go into TRANSIENT_FAILURE

Does it happen on the client side or the server side?

It is an erroneous guess that the client used a higher version of grpc. We only need to add keepalive params.

close #6681 Signed-off-by: lhy1024 <[email protected]> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>

close tikv#6681 Signed-off-by: lhy1024 <[email protected]> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>

binshi-bing added the type/enhancement The issue or PR belongs to an enhancement. label Jun 26, 2023

lhy1024 mentioned this issue Jun 27, 2023

tools: add keepalive for pd-tso-bench #6699

Merged

ti-chi-bot bot closed this as completed in #6699 Jun 28, 2023

ti-chi-bot bot added a commit that referenced this issue Jun 28, 2023

tools: add keepalive for pd-tso-bench (#6699)

01015a6

close #6681 Signed-off-by: lhy1024 <[email protected]> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>

binshi-bing changed the title ~~pd client: updateMember can't recover after deleting all PD/API pods w/o graceful period.~~ pd-tso-bench: client updateMember can't recover after deleting all PD/API pods w/o graceful period. Jun 29, 2023

rleungx pushed a commit to rleungx/pd that referenced this issue Aug 2, 2023

tools: add keepalive for pd-tso-bench (tikv#6699)

0935c73

close tikv#6681 Signed-off-by: lhy1024 <[email protected]> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>

rleungx pushed a commit to rleungx/pd that referenced this issue Aug 2, 2023

tools: add keepalive for pd-tso-bench (tikv#6699)

8e775ae

close tikv#6681 Signed-off-by: lhy1024 <[email protected]> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pd-tso-bench: client updateMember can't recover after deleting all PD/API pods w/o graceful period. #6681

pd-tso-bench: client updateMember can't recover after deleting all PD/API pods w/o graceful period. #6681

binshi-bing commented Jun 26, 2023 •

edited

Loading

lhy1024 commented Jun 27, 2023

lhy1024 commented Jun 27, 2023

rleungx commented Jun 28, 2023

lhy1024 commented Jun 28, 2023

pd-tso-bench: client updateMember can't recover after deleting all PD/API pods w/o graceful period. #6681

pd-tso-bench: client updateMember can't recover after deleting all PD/API pods w/o graceful period. #6681

Comments

binshi-bing commented Jun 26, 2023 • edited Loading

Enhancement Task

What did I do?

lhy1024 commented Jun 27, 2023

lhy1024 commented Jun 27, 2023

rleungx commented Jun 28, 2023

lhy1024 commented Jun 28, 2023

binshi-bing commented Jun 26, 2023 •

edited

Loading