Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changefeedccl: 100k range catchup scan benchmark fails to complete #108157

Closed
erikgrinaker opened this issue Aug 4, 2023 · 5 comments
Closed
Assignees
Labels
A-cdc Change Data Capture C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-performance Perf of queries or internals. Solution not expected to change functional behavior. skipped-test T-cdc

Comments

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Aug 4, 2023

The 100k range catchup scan benchmark in #107722 (cdc/scan/catchup/nodes=5/cpu=16/rows=1G/ranges=100K/protocol=rangefeed/format=json/sink=null ) fails to complete. It either fails with DistSQL inbox errors, or the changefeed restarts the catchup scans. We should find out why and fix it, even if it's only because the cluster can't handle the load.

The test has been skipped for now.

Jira issue: CRDB-30335

Epic CRDB-26372

@erikgrinaker erikgrinaker added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-performance Perf of queries or internals. Solution not expected to change functional behavior. T-cdc labels Aug 4, 2023
@blathers-crl
Copy link

blathers-crl bot commented Aug 4, 2023

cc @cockroachdb/cdc

@erikgrinaker
Copy link
Contributor Author

I see frequent changefeed restarts because of RPC connection failures between nodes, likely because of overload or timeouts, although I haven't been able to pin them down. Will try a run with a dedicated rangefeed RPC connection and disabling admission control.

@erikgrinaker
Copy link
Contributor Author

Still fails with dedicated rangefeed RPC class and no catchup admission control.

In one example, we see a single node n2 lose network connectivity to all other nodes, failing with EOF:

W230804 09:41:12.906047 3328 kv/kvserver/closedts/sidetransport/sender.go:838 ⋮ [T1,n2,ctstream=4] 80040 failed to send closed timestamp message 15017 to n4: send msg error: ‹EOF›
W230804 09:41:12.906057 3327 kv/kvserver/closedts/sidetransport/sender.go:838 ⋮ [T1,n2,ctstream=3] 80041 failed to send closed timestamp message 15017 to n3: send msg error: ‹EOF›
W230804 09:41:12.906051 3329 kv/kvserver/closedts/sidetransport/sender.go:838 ⋮ [T1,n2,ctstream=5] 80042 failed to send closed timestamp message 15017 to n5: send msg error: ‹EOF›
W230804 09:41:12.906051 3326 kv/kvserver/closedts/sidetransport/sender.go:838 ⋮ [T1,n2,ctstream=1] 80043 failed to send closed timestamp message 15017 to n1: send msg error: ‹EOF›

This strikes me as suspect. I could have accepted it as a random infra fluke, but this keeps happening in every single benchmark run. I wonder if we're hitting gRPC limits or something.

@erikgrinaker
Copy link
Contributor Author

We see this across all connection classes, and the connections immediately recover afterwards:

E230804 09:41:12.834957 3137 2@rpc/peer.go:590 ⋮ [T1,n2,rnode=5,raddr=‹10.142.1.2:26257›,class=default,rpc] 855 disconnected (was healthy for 58m27.089s): grpc: ‹connection error: desc = "transport: error while dialing: connection interrupted (did the remote node shut down or are there networking issues?)"› [code 14/Unavailable]
E230804 09:41:12.838494 546 2@rpc/peer.go:590 ⋮ [T1,n2,rnode=1,raddr=‹10.142.0.2:26257›,class=default,rpc] 856 disconnected (was healthy for 58m44.327s): grpc: ‹connection error: desc = "transport: error while dialing: connection interrupted (did the remote node shut down or are there networking issues?)"› [code 14/Unavailable]
E230804 09:41:12.840960 6497915 2@rpc/peer.go:590 ⋮ [T1,n2,rnode=6,raddr=‹10.142.0.220:26257›,class=default,rpc] 857 disconnected (was healthy for 10m39.984s): grpc: ‹error reading from server: EOF› [code 14/Unavailable]
E230804 09:41:12.841478 306 2@rpc/peer.go:590 ⋮ [T1,n2,rnode=?,raddr=‹34.148.94.148:26257›,class=system,rpc] 858 disconnected (was healthy for 58m44.346s): grpc: ‹error reading from server: EOF› [code 14/Unavailable]
E230804 09:41:12.841620 2878 2@rpc/peer.go:590 ⋮ [T1,n2,rnode=4,raddr=‹10.142.0.227:26257›,class=system,rpc] 859 disconnected (was healthy for 58m32.721s): grpc: ‹error reading from server: EOF› [code 14/Unavailable]
E230804 09:41:12.841826 3081 2@rpc/peer.go:590 ⋮ [T1,n2,rnode=5,raddr=‹10.142.1.2:26257›,class=system,rpc] 860 disconnected (was healthy for 58m27.098s): grpc: ‹error reading from server: EOF› [code 14/Unavailable]
E230804 09:41:12.842237 2888 2@rpc/peer.go:590 ⋮ [T1,n2,rnode=4,raddr=‹10.142.0.227:26257›,class=default,rpc] 861 disconnected (was healthy for 58m32.721s): grpc: ‹error reading from server: EOF› [code 14/Unavailable]
E230804 09:41:12.843048 6424353 2@rpc/peer.go:590 ⋮ [T1,n2,rnode=6,raddr=‹10.142.0.220:26257›,class=system,rpc] 862 disconnected (was healthy for 10m48.009s): grpc: ‹error reading from server: EOF› [code 14/Unavailable]
E230804 09:41:12.845610 2904 2@rpc/peer.go:590 ⋮ [T1,n2,rnode=3,raddr=‹10.142.0.225:26257›,class=default,rpc] 863 disconnected (was healthy for 58m26.965s): grpc: ‹connection error: desc = "transport: error while dialing: connection interrupted (did the remote node shut down or are there networking issues?)"› [code 14/Unavailable]
E230804 09:41:12.849446 3505 2@rpc/peer.go:590 ⋮ [T1,n2,rnode=2,raddr=‹10.142.0.233:26257›,class=default,rpc] 864 disconnected (was healthy for 58m26.591s): grpc: ‹error reading from server: EOF› [code 14/Unavailable]
W230804 09:41:12.858687 257 2@gossip/gossip.go:1407 ⋮ [T1,n2] 865 no incoming or outgoing connections
E230804 09:41:12.863144 3087 2@rpc/peer.go:590 ⋮ [T1,n2,rnode=3,raddr=‹10.142.0.225:26257›,class=system,rpc] 866 disconnected (was healthy for 58m27.01s): grpc: ‹connection error: desc = "transport: error while dialing: connection interrupted (did the remote node shut down or are there networking issues?)"› [code 14/Unavailable]
E230804 09:41:12.865462 310 2@rpc/peer.go:590 ⋮ [T1,n2,rnode=1,raddr=‹10.142.0.2:26257›,class=system,rpc] 867 disconnected (was healthy for 58m44.369s): grpc: ‹connection error: desc = "transport: error while dialing: connection interrupted (did the remote node shut down or are there networking issues?)"› [code 14/Unavailable]
I230804 09:41:12.866803 3505 2@rpc/peer.go:537 ⋮ [T1,n2,rnode=2,raddr=‹10.142.0.233:26257›,class=default,rpc] 868 ‹connection is now healthy (after 0s)›
I230804 09:41:12.871237 310 2@rpc/peer.go:537 ⋮ [T1,n2,rnode=1,raddr=‹10.142.0.2:26257›,class=system,rpc] 869 ‹connection is now healthy (after 0s)›
I230804 09:41:12.876902 546 2@rpc/peer.go:537 ⋮ [T1,n2,rnode=1,raddr=‹10.142.0.2:26257›,class=default,rpc] 870 ‹connection is now healthy (after 0s)›
I230804 09:41:12.877221 2878 2@rpc/peer.go:537 ⋮ [T1,n2,rnode=4,raddr=‹10.142.0.227:26257›,class=system,rpc] 872 ‹connection is now healthy (after 0s)›
I230804 09:41:12.877217 2888 2@rpc/peer.go:537 ⋮ [T1,n2,rnode=4,raddr=‹10.142.0.227:26257›,class=default,rpc] 871 ‹connection is now healthy (after 0s)›
I230804 09:41:12.877235 3081 2@rpc/peer.go:537 ⋮ [T1,n2,rnode=5,raddr=‹10.142.1.2:26257›,class=system,rpc] 873 ‹connection is now healthy (after 0s)›
I230804 09:41:12.877246 6424353 2@rpc/peer.go:537 ⋮ [T1,n2,rnode=6,raddr=‹10.142.0.220:26257›,class=system,rpc] 874 ‹connection is now healthy (after 0s)›
I230804 09:41:12.896674 3137 2@rpc/peer.go:537 ⋮ [T1,n2,rnode=5,raddr=‹10.142.1.2:26257›,class=default,rpc] 875 ‹connection is now healthy (after 0s)›
I230804 09:41:12.897654 3087 2@rpc/peer.go:537 ⋮ [T1,n2,rnode=3,raddr=‹10.142.0.225:26257›,class=system,rpc] 876 ‹connection is now healthy (after 0s)›
I230804 09:41:12.898020 7309479 2@rpc/peer.go:537 ⋮ [T1,n2,rnode=?,raddr=‹10.142.0.225:26257›,class=system,rpc] 877 ‹connection is now healthy›
I230804 09:41:12.898189 2904 2@rpc/peer.go:537 ⋮ [T1,n2,rnode=3,raddr=‹10.142.0.225:26257›,class=default,rpc] 878 ‹connection is now healthy (after 0s)›
I230804 09:41:12.903133 6497915 2@rpc/peer.go:537 ⋮ [T1,n2,rnode=6,raddr=‹10.142.0.220:26257›,class=default,rpc] 879 ‹connection is now healthy (after 0s)›

@miretskiy miretskiy self-assigned this Aug 7, 2023
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Aug 24, 2023
Re-enable regular rangefeed catchup benchmark over 100k ranges.
Adjust cdc bench configuration to ensure the benchmark completes
in reasonable time.

Fixes cockroachdb#108157

Release note: None
@erikgrinaker
Copy link
Contributor Author

FWIW, the connection closures are an instance of #109317 due to node overload (not yet clear why it affects the kernel to this extent).

The benchmark still fails to complete once the connection closures are addressed with a higher gRPC server timeout, due to node overload.

miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Sep 7, 2023
Re-enable regular rangefeed catchup benchmark over 100k ranges.
Adjust cdc bench configuration to ensure the benchmark completes
in reasonable time.

Fixes cockroachdb#108157

Release note: None
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Sep 7, 2023
Re-enable regular rangefeed catchup benchmark over 100k ranges.
Adjust cdc bench configuration to ensure the benchmark completes
in reasonable time.

Fixes cockroachdb#108157

Release note: None
@craig craig bot closed this as completed in 526100a Sep 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cdc Change Data Capture C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-performance Perf of queries or internals. Solution not expected to change functional behavior. skipped-test T-cdc
Projects
None yet
Development

No branches or pull requests

2 participants