changefeedccl: investigate and address changefeed failure due to unavailable replica #89663

amruss · 2022-10-10T15:57:46Z

Our telemetry changfeeds failed with the message:

replica unavailable: (n1,s1):3 unable to serve request to r170780:/Table/104/2/"\t\x{a5\xbc\xb6g\xeeKI\xa5F\x8eb\xfc9}\xab"/1/4/1918-04-19T12:43:26.240716999Z/"sql.misc.started.count"-b0Q\xbcG\x17Ng\xb0\xc8\x16 \xa8\xa7\xca`"/1/3/1918-09-14T16:35:54.879832999Z/"sql.plan.ops.cast.int::string"} [(n14,s14):4, (n3,s3):2, (n1,s1):3, next=5, gen=197]: lost quorum (down: (n14,s14):4,(n3,s3):2); closed timestamp: 1665220705.166109291,0 (2022-10-08 09:18:25); raft status: {"id":"3","term":160,"vote":"2","commit":6282066,"lead":"0","raftState":"StatePreCandidate","applied":6282066,"progress":{},"leadtransferee":"0"}: have been waiting 60.20s for slow proposal RequestLease [/Table/104/2/"\t\xa5\xbc\xb6g\xeeKI\xa5F\x8eb\xfc9}\xab"/1/4/1918-04-19T12:43:26.240716999Z/"sql.misc.started.count",/Min) | 9

We should investigate and address these failures.

See ticket for more info: https://cockroachdb.zendesk.com/agent/tickets/14297

Jira issue: CRDB-20362

Epic CRDB-11732

blathers-crl · 2022-10-10T15:57:50Z

cc @cockroachdb/cdc

amruss · 2022-10-12T16:03:41Z

Closing, conversation will continue on the zendesk ticket. The larger issue is in becoming more default retryable

88492: Roachtest redirect SSH flakes to test-eng r=tbg a=smg260 *See second commit note at the bottom* This PR inspects the failure output of a roachtest, and if it sees an SSH_PROBLEM, overrides the owning team to test-eng when reporting the github issue. Currently errors are classified as an `SSH` error by roachprod if the exit code is `255` with an accompanying message prefixed with `SSH_PROBLEM` [[1]](https://github.com/cockroachdb/cockroach/blob/ad3bd1355463cefdc07e995765fa82adfe391d05/pkg/roachprod/errors/errors.go#L112). The errors are stringified and saved into `t.mu.output|failureMsg`. Thus in the test_runner at the call site of issue posting, we can check `t.mu.output` for `SSH_PROBLEM` and override the team and issue name accordingly. Resolves: #82398 Release justification: test-only change Release note: none 89913: changefeedccl: job-level retry when error message is about draining r=[miretskiy] a=HonoreDB See #https://github.com/cockroachlabs/support/issues/1839. The flow retryable error marker doesn't survive every path by which it can bubble up, so just look for the single word "draining" as false positives are much better than false negatives. Fixes #89663 Release note (enterprise change): Fixed a bug that could cause changefeeds to fail during a rolling restart. Co-authored-by: Miral Gadani <[email protected]> Co-authored-by: Aaron Zinger <[email protected]>

amruss added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Oct 10, 2022

blathers-crl bot added the T-cdc label Oct 10, 2022

amruss closed this as completed Oct 12, 2022

HonoreDB reopened this Oct 13, 2022

HonoreDB mentioned this issue Oct 13, 2022

changefeedccl: job-level retry when error message is about draining #89913

Merged

amruss assigned HonoreDB Oct 19, 2022

craig bot closed this as completed in #89913 Oct 25, 2022

blathers-crl bot mentioned this issue Oct 25, 2022

release-22.2: changefeedccl: job-level retry when error message is about draining #90661

Merged

This was referenced Oct 26, 2022

release-22.2: changefeedccl: job-level retry when error message is about draining #90717

Closed

release-22.2.0: changefeedccl: job-level retry when error message is about draining #90718

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

changefeedccl: investigate and address changefeed failure due to unavailable replica #89663

changefeedccl: investigate and address changefeed failure due to unavailable replica #89663

amruss commented Oct 10, 2022 •

edited by exalate-issue-sync bot

Loading

blathers-crl bot commented Oct 10, 2022

amruss commented Oct 12, 2022

changefeedccl: investigate and address changefeed failure due to unavailable replica #89663

changefeedccl: investigate and address changefeed failure due to unavailable replica #89663

Comments

amruss commented Oct 10, 2022 • edited by exalate-issue-sync bot Loading

blathers-crl bot commented Oct 10, 2022

amruss commented Oct 12, 2022

amruss commented Oct 10, 2022 •

edited by exalate-issue-sync bot

Loading