-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
changefeedccl: 100k range catchup scan benchmark fails to complete #108157
Comments
cc @cockroachdb/cdc |
I see frequent changefeed restarts because of RPC connection failures between nodes, likely because of overload or timeouts, although I haven't been able to pin them down. Will try a run with a dedicated rangefeed RPC connection and disabling admission control. |
Still fails with dedicated rangefeed RPC class and no catchup admission control. In one example, we see a single node n2 lose network connectivity to all other nodes, failing with EOF:
This strikes me as suspect. I could have accepted it as a random infra fluke, but this keeps happening in every single benchmark run. I wonder if we're hitting gRPC limits or something. |
We see this across all connection classes, and the connections immediately recover afterwards:
|
Re-enable regular rangefeed catchup benchmark over 100k ranges. Adjust cdc bench configuration to ensure the benchmark completes in reasonable time. Fixes cockroachdb#108157 Release note: None
FWIW, the connection closures are an instance of #109317 due to node overload (not yet clear why it affects the kernel to this extent). The benchmark still fails to complete once the connection closures are addressed with a higher gRPC server timeout, due to node overload. |
Re-enable regular rangefeed catchup benchmark over 100k ranges. Adjust cdc bench configuration to ensure the benchmark completes in reasonable time. Fixes cockroachdb#108157 Release note: None
Re-enable regular rangefeed catchup benchmark over 100k ranges. Adjust cdc bench configuration to ensure the benchmark completes in reasonable time. Fixes cockroachdb#108157 Release note: None
The 100k range catchup scan benchmark in #107722 (
cdc/scan/catchup/nodes=5/cpu=16/rows=1G/ranges=100K/protocol=rangefeed/format=json/sink=null
) fails to complete. It either fails with DistSQL inbox errors, or the changefeed restarts the catchup scans. We should find out why and fix it, even if it's only because the cluster can't handle the load.The test has been skipped for now.
Jira issue: CRDB-30335
Epic CRDB-26372
The text was updated successfully, but these errors were encountered: