-
Notifications
You must be signed in to change notification settings - Fork 726
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mcs, tso: PD/API forward streaming is expensive #6659
Comments
[2023/06/23 16:38:34.508 +00:00] [WARN] [peer.go:267] ["dropped internal Raft message since sending buffer is full (overloaded network)"] [message-type=MsgHeartbeat] [local-member-id=1594bb0eb9d011f6] [from=1594bb0eb9d011f6] [remote-peer-id=5d7de14a2e2790a0] [remote-peer-active=false] |
[2023/06/23 17:04:54.626 +00:00] [INFO] [coordinator.go:320] ["coordinator starts to collect cluster information"] |
…hanism. (#6664) ref #6659 Fix expensive async forwardTSORequest() and its timeout mechanism. In order to handle the timeout case for forwardStream send/recv, the existing logic is to create context.withTimeout(forwardCtx,...) for every request, then start a new goroutine "forwardTSORequest", which is very expensive as shown by the profiling in #6659. This change create a watchDeadline routine per forward stream and reuse it for all the forward requests in which forwardTSORequest is called synchronously. Compared to the existing logic, the new change is much cheaper and the latency is much stable. Signed-off-by: Bin Shi <[email protected]>
…hanism. (tikv#6664) ref tikv#6659 Fix expensive async forwardTSORequest() and its timeout mechanism. In order to handle the timeout case for forwardStream send/recv, the existing logic is to create context.withTimeout(forwardCtx,...) for every request, then start a new goroutine "forwardTSORequest", which is very expensive as shown by the profiling in tikv#6659. This change create a watchDeadline routine per forward stream and reuse it for all the forward requests in which forwardTSORequest is called synchronously. Compared to the existing logic, the new change is much cheaper and the latency is much stable. Signed-off-by: Bin Shi <[email protected]>
…hanism. (tikv#6664) ref tikv#6659 Fix expensive async forwardTSORequest() and its timeout mechanism. In order to handle the timeout case for forwardStream send/recv, the existing logic is to create context.withTimeout(forwardCtx,...) for every request, then start a new goroutine "forwardTSORequest", which is very expensive as shown by the profiling in tikv#6659. This change create a watchDeadline routine per forward stream and reuse it for all the forward requests in which forwardTSORequest is called synchronously. Compared to the existing logic, the new change is much cheaper and the latency is much stable. Signed-off-by: Bin Shi <[email protected]>
Enhancement Task
profile004.pdf
[2023/06/23 16:23:04.959 +00:00] [INFO] [trace.go:152] ["trace[529495468] linearizableReadLoop"] [detail="{readStateIndex:52252203; appliedIndex:52252203; }"] [duration=565.10434ms] [start=2023/06/23 16:23:04.394 +00:00] [end=2023/06/23 16:23:04.959 +00:00] [steps="["trace[529495468] 'read index received' (duration: 565.085558ms)","trace[529495468] 'applied index is now lower than readState.Index' (duration: 3.208µs)"]"]
[2023/06/23 16:23:04.962 +00:00] [WARN] [util.go:163] ["apply request took too long"] [took=567.79578ms] [expected-duration=100ms] [prefix="read-only range "] [request="key:"/pd/7187976276065784319/gc/safe_point" "] [response="range_response_count:1 size:79"] []
[2023/06/23 16:23:04.962 +00:00] [INFO] [trace.go:152] ["trace[1260142690] range"] [detail="{range_begin:/pd/7187976276065784319/gc/safe_point; range_end:; response_count:1; response_revision:49649625; }"] [duration=567.883329ms] [start=2023/06/23 16:23:04.394 +00:00] [end=2023/06/23 16:23:04.962 +00:00] [steps="["trace[1260142690] 'agreement among raft nodes before linearized reading' (duration: 567.767596ms)"]"]
[2023/06/23 16:23:04.966 +00:00] [WARN] [util.go:163] ["apply request took too long"] [took=572.019237ms] [expected-duration=100ms] [prefix="read-only range "] [request="key:"/pd/7187976276065784319/gc/safe_point" "] [response="range_response_count:1 size:79"] []
[2023/06/23 16:23:04.966 +00:00] [INFO] [trace.go:152] ["trace[96759071] range"] [detail="{range_begin:/pd/7187976276065784319/gc/safe_point; range_end:; response_count:1; response_revision:49649625; }"] [duration=572.104726ms] [start=2023/06/23 16:23:04.394 +00:00] [end=2023/06/23 16:23:04.966 +00:00] [steps="["trace[96759071] 'agreement among raft nodes before linearized reading' (duration: 571.971197ms)"]"]
[2023/06/23 16:23:04.968 +00:00] [WARN] [util.go:121] ["failed to apply request"] [took=24.352µs] [request="header:<ID:1294372468269364274 > lease_revoke:id:11f688e6ea3db3f8"] [response=size:30] [error="lease not found"]
[2023/06/23 16:23:06.279 +00:00] [WARN] [util.go:121] ["failed to apply request"] [took=41.321µs] [request="header:<ID:1294372468269364300 > lease_revoke:id:11f688e6ea3db390"] [response=size:30] [error="lease not found"]
[2023/06/23 16:23:06.807 +00:00] [INFO] [grpc_service.go:1742] ["update service GC safe point"] [service-id=gc_worker] [expire-at=-9223372035167238423] [safepoint=442377643433918464]
[2023/06/23 16:23:12.698 +00:00] [WARN] [grpclog.go:60] ["transport: http2Server.HandleStreams failed to read frame: read tcp 10.0.102.147:2379->10.0.121.190:42872: read: connection reset by peer"]
[2023/06/23 16:23:14.325 +00:00] [INFO] [trace.go:152] ["trace[760380547] linearizableReadLoop"] [detail="{readStateIndex:52252313; appliedIndex:52252313; }"] [duration=442.316662ms] [start=2023/06/23 16:23:13.883 +00:00] [end=2023/06/23 16:23:14.325 +00:00] [steps="["trace[760380547] 'read index received' (duration: 442.312855ms)","trace[760380547] 'applied index is now lower than readState.Index' (duration: 2.946µs)"]"]
[2023/06/23 16:23:14.326 +00:00] [WARN] [util.go:163] ["apply request took too long"] [took=442.662924ms] [expected-duration=100ms] [prefix="read-only range "] [request="key:"/pd/7187976276065784319/gc/safe_point" "] [response="range_response_count:1 size:79"] []
[2023/06/23 16:23:14.326 +00:00] [INFO] [trace.go:152] ["trace[548987313] range"] [detail="{range_begin:/pd/7187976276065784319/gc/safe_point; range_end:; response_count:1; response_revision:49649702; }"] [duration=442.759416ms] [start=2023/06/23 16:23:13.883 +00:00] [end=2023/06/23 16:23:14.326 +00:00] [steps="["trace[548987313] 'agreement among raft nodes before linearized reading' (duration: 442.467644ms)"]"]
[2023/06/23 16:23:14.978 +00:00] [WARN] [util.go:121] ["failed to apply request"] [took=43.11µs] [request="header:<ID:1294372468269364437 > lease_revoke:id:11f688e6ea3db3fe"] [response=size:30] [error="lease not found"]
[2023/06/23 16:23:25.639 +00:00] [WARN] [raft.go:440] ["leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk"] [to=bb2483bf620467b7] [heartbeat-interval=500ms] [expected-duration=1s] [exceeded-duration=295.792928ms]
[2023/06/23 16:23:25.639 +00:00] [WARN] [raft.go:440] ["leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk"] [to=1594bb0eb9d011f6] [heartbeat-interval=500ms] [expected-duration=1s] [exceeded-duration=295.878113ms]
[2023/06/23 16:23:32.128 +00:00] [WARN] [raft.go:440] ["leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk"] [to=bb2483bf620467b7] [heartbeat-interval=500ms] [expected-duration=1s] [exceeded-duration=525.556745ms]
[2023/06/23 16:23:32.129 +00:00] [WARN] [raft.go:440] ["leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk"] [to=1594bb0eb9d011f6] [heartbeat-interval=500ms] [expected-duration=1s] [exceeded-duration=525.601339ms]
[2023/06/23 16:23:36.075 +00:00] [WARN] [grpclog.go:60] ["transport: http2Server.HandleStreams failed to read frame: read tcp 10.0.102.147:2379->10.0.121.190:33884: read: connection reset by peer"]
The text was updated successfully, but these errors were encountered: