changefeedccl: 100k range catchup scan benchmark fails to complete #108157

erikgrinaker · 2023-08-04T08:50:50Z

The 100k range catchup scan benchmark in #107722 (cdc/scan/catchup/nodes=5/cpu=16/rows=1G/ranges=100K/protocol=rangefeed/format=json/sink=null ) fails to complete. It either fails with DistSQL inbox errors, or the changefeed restarts the catchup scans. We should find out why and fix it, even if it's only because the cluster can't handle the load.

The test has been skipped for now.

Jira issue: CRDB-30335

Epic CRDB-26372

The text was updated successfully, but these errors were encountered:

blathers-crl · 2023-08-04T08:50:52Z

cc @cockroachdb/cdc

erikgrinaker · 2023-08-04T10:13:22Z

I see frequent changefeed restarts because of RPC connection failures between nodes, likely because of overload or timeouts, although I haven't been able to pin them down. Will try a run with a dedicated rangefeed RPC connection and disabling admission control.

erikgrinaker · 2023-08-04T13:14:22Z

Still fails with dedicated rangefeed RPC class and no catchup admission control.

In one example, we see a single node n2 lose network connectivity to all other nodes, failing with EOF:

W230804 09:41:12.906047 3328 kv/kvserver/closedts/sidetransport/sender.go:838 ⋮ [T1,n2,ctstream=4] 80040 failed to send closed timestamp message 15017 to n4: send msg error: ‹EOF›
W230804 09:41:12.906057 3327 kv/kvserver/closedts/sidetransport/sender.go:838 ⋮ [T1,n2,ctstream=3] 80041 failed to send closed timestamp message 15017 to n3: send msg error: ‹EOF›
W230804 09:41:12.906051 3329 kv/kvserver/closedts/sidetransport/sender.go:838 ⋮ [T1,n2,ctstream=5] 80042 failed to send closed timestamp message 15017 to n5: send msg error: ‹EOF›
W230804 09:41:12.906051 3326 kv/kvserver/closedts/sidetransport/sender.go:838 ⋮ [T1,n2,ctstream=1] 80043 failed to send closed timestamp message 15017 to n1: send msg error: ‹EOF›

This strikes me as suspect. I could have accepted it as a random infra fluke, but this keeps happening in every single benchmark run. I wonder if we're hitting gRPC limits or something.

erikgrinaker · 2023-08-04T13:16:08Z

We see this across all connection classes, and the connections immediately recover afterwards:

E230804 09:41:12.834957 3137 2@rpc/peer.go:590 ⋮ [T1,n2,rnode=5,raddr=‹10.142.1.2:26257›,class=default,rpc] 855 disconnected (was healthy for 58m27.089s): grpc: ‹connection error: desc = "transport: error while dialing: connection interrupted (did the remote node shut down or are there networking issues?)"› [code 14/Unavailable]
E230804 09:41:12.838494 546 2@rpc/peer.go:590 ⋮ [T1,n2,rnode=1,raddr=‹10.142.0.2:26257›,class=default,rpc] 856 disconnected (was healthy for 58m44.327s): grpc: ‹connection error: desc = "transport: error while dialing: connection interrupted (did the remote node shut down or are there networking issues?)"› [code 14/Unavailable]
E230804 09:41:12.840960 6497915 2@rpc/peer.go:590 ⋮ [T1,n2,rnode=6,raddr=‹10.142.0.220:26257›,class=default,rpc] 857 disconnected (was healthy for 10m39.984s): grpc: ‹error reading from server: EOF› [code 14/Unavailable]
E230804 09:41:12.841478 306 2@rpc/peer.go:590 ⋮ [T1,n2,rnode=?,raddr=‹34.148.94.148:26257›,class=system,rpc] 858 disconnected (was healthy for 58m44.346s): grpc: ‹error reading from server: EOF› [code 14/Unavailable]
E230804 09:41:12.841620 2878 2@rpc/peer.go:590 ⋮ [T1,n2,rnode=4,raddr=‹10.142.0.227:26257›,class=system,rpc] 859 disconnected (was healthy for 58m32.721s): grpc: ‹error reading from server: EOF› [code 14/Unavailable]
E230804 09:41:12.841826 3081 2@rpc/peer.go:590 ⋮ [T1,n2,rnode=5,raddr=‹10.142.1.2:26257›,class=system,rpc] 860 disconnected (was healthy for 58m27.098s): grpc: ‹error reading from server: EOF› [code 14/Unavailable]
E230804 09:41:12.842237 2888 2@rpc/peer.go:590 ⋮ [T1,n2,rnode=4,raddr=‹10.142.0.227:26257›,class=default,rpc] 861 disconnected (was healthy for 58m32.721s): grpc: ‹error reading from server: EOF› [code 14/Unavailable]
E230804 09:41:12.843048 6424353 2@rpc/peer.go:590 ⋮ [T1,n2,rnode=6,raddr=‹10.142.0.220:26257›,class=system,rpc] 862 disconnected (was healthy for 10m48.009s): grpc: ‹error reading from server: EOF› [code 14/Unavailable]
E230804 09:41:12.845610 2904 2@rpc/peer.go:590 ⋮ [T1,n2,rnode=3,raddr=‹10.142.0.225:26257›,class=default,rpc] 863 disconnected (was healthy for 58m26.965s): grpc: ‹connection error: desc = "transport: error while dialing: connection interrupted (did the remote node shut down or are there networking issues?)"› [code 14/Unavailable]
E230804 09:41:12.849446 3505 2@rpc/peer.go:590 ⋮ [T1,n2,rnode=2,raddr=‹10.142.0.233:26257›,class=default,rpc] 864 disconnected (was healthy for 58m26.591s): grpc: ‹error reading from server: EOF› [code 14/Unavailable]
W230804 09:41:12.858687 257 2@gossip/gossip.go:1407 ⋮ [T1,n2] 865 no incoming or outgoing connections
E230804 09:41:12.863144 3087 2@rpc/peer.go:590 ⋮ [T1,n2,rnode=3,raddr=‹10.142.0.225:26257›,class=system,rpc] 866 disconnected (was healthy for 58m27.01s): grpc: ‹connection error: desc = "transport: error while dialing: connection interrupted (did the remote node shut down or are there networking issues?)"› [code 14/Unavailable]
E230804 09:41:12.865462 310 2@rpc/peer.go:590 ⋮ [T1,n2,rnode=1,raddr=‹10.142.0.2:26257›,class=system,rpc] 867 disconnected (was healthy for 58m44.369s): grpc: ‹connection error: desc = "transport: error while dialing: connection interrupted (did the remote node shut down or are there networking issues?)"› [code 14/Unavailable]
I230804 09:41:12.866803 3505 2@rpc/peer.go:537 ⋮ [T1,n2,rnode=2,raddr=‹10.142.0.233:26257›,class=default,rpc] 868 ‹connection is now healthy (after 0s)›
I230804 09:41:12.871237 310 2@rpc/peer.go:537 ⋮ [T1,n2,rnode=1,raddr=‹10.142.0.2:26257›,class=system,rpc] 869 ‹connection is now healthy (after 0s)›
I230804 09:41:12.876902 546 2@rpc/peer.go:537 ⋮ [T1,n2,rnode=1,raddr=‹10.142.0.2:26257›,class=default,rpc] 870 ‹connection is now healthy (after 0s)›
I230804 09:41:12.877221 2878 2@rpc/peer.go:537 ⋮ [T1,n2,rnode=4,raddr=‹10.142.0.227:26257›,class=system,rpc] 872 ‹connection is now healthy (after 0s)›
I230804 09:41:12.877217 2888 2@rpc/peer.go:537 ⋮ [T1,n2,rnode=4,raddr=‹10.142.0.227:26257›,class=default,rpc] 871 ‹connection is now healthy (after 0s)›
I230804 09:41:12.877235 3081 2@rpc/peer.go:537 ⋮ [T1,n2,rnode=5,raddr=‹10.142.1.2:26257›,class=system,rpc] 873 ‹connection is now healthy (after 0s)›
I230804 09:41:12.877246 6424353 2@rpc/peer.go:537 ⋮ [T1,n2,rnode=6,raddr=‹10.142.0.220:26257›,class=system,rpc] 874 ‹connection is now healthy (after 0s)›
I230804 09:41:12.896674 3137 2@rpc/peer.go:537 ⋮ [T1,n2,rnode=5,raddr=‹10.142.1.2:26257›,class=default,rpc] 875 ‹connection is now healthy (after 0s)›
I230804 09:41:12.897654 3087 2@rpc/peer.go:537 ⋮ [T1,n2,rnode=3,raddr=‹10.142.0.225:26257›,class=system,rpc] 876 ‹connection is now healthy (after 0s)›
I230804 09:41:12.898020 7309479 2@rpc/peer.go:537 ⋮ [T1,n2,rnode=?,raddr=‹10.142.0.225:26257›,class=system,rpc] 877 ‹connection is now healthy›
I230804 09:41:12.898189 2904 2@rpc/peer.go:537 ⋮ [T1,n2,rnode=3,raddr=‹10.142.0.225:26257›,class=default,rpc] 878 ‹connection is now healthy (after 0s)›
I230804 09:41:12.903133 6497915 2@rpc/peer.go:537 ⋮ [T1,n2,rnode=6,raddr=‹10.142.0.220:26257›,class=default,rpc] 879 ‹connection is now healthy (after 0s)›

Re-enable regular rangefeed catchup benchmark over 100k ranges. Adjust cdc bench configuration to ensure the benchmark completes in reasonable time. Fixes cockroachdb#108157 Release note: None

erikgrinaker · 2023-08-28T12:33:04Z

FWIW, the connection closures are an instance of #109317 due to node overload (not yet clear why it affects the kernel to this extent).

The benchmark still fails to complete once the connection closures are addressed with a higher gRPC server timeout, due to node overload.

Re-enable regular rangefeed catchup benchmark over 100k ranges. Adjust cdc bench configuration to ensure the benchmark completes in reasonable time. Fixes cockroachdb#108157 Release note: None

erikgrinaker added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-performance Perf of queries or internals. Solution not expected to change functional behavior. T-cdc labels Aug 4, 2023

blathers-crl bot added the A-cdc Change Data Capture label Aug 4, 2023

erikgrinaker mentioned this issue Aug 4, 2023

roachtest: add changefeed scan benchmarks #107722

Merged

erikgrinaker added the skipped-test label Aug 4, 2023

erikgrinaker mentioned this issue Aug 7, 2023

roachtest: cdc/scan/initial/nodes=5/cpu=16/rows=1G/ranges=100k/protocol=rangefeed/format=json/sink=null failed #108243

Closed

miretskiy self-assigned this Aug 7, 2023

miretskiy mentioned this issue Aug 9, 2023

cdc: Lagging span checkpoint with end_time option might be broken #108464

Closed

erikgrinaker mentioned this issue Aug 29, 2023

kvserver: reconsider gRPC server-side keepalive timeout #109317

Open

craig bot closed this as completed in 526100a Sep 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

changefeedccl: 100k range catchup scan benchmark fails to complete #108157

changefeedccl: 100k range catchup scan benchmark fails to complete #108157

erikgrinaker commented Aug 4, 2023 •

edited

Loading

blathers-crl bot commented Aug 4, 2023

erikgrinaker commented Aug 4, 2023

erikgrinaker commented Aug 4, 2023

erikgrinaker commented Aug 4, 2023

erikgrinaker commented Aug 28, 2023

changefeedccl: 100k range catchup scan benchmark fails to complete #108157

changefeedccl: 100k range catchup scan benchmark fails to complete #108157

Comments

erikgrinaker commented Aug 4, 2023 • edited Loading

blathers-crl bot commented Aug 4, 2023

erikgrinaker commented Aug 4, 2023

erikgrinaker commented Aug 4, 2023

erikgrinaker commented Aug 4, 2023

erikgrinaker commented Aug 28, 2023

erikgrinaker commented Aug 4, 2023 •

edited

Loading