-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
changefeedccl: transient schemafeed errors are not retried #63317
Comments
I honestly think that for sink-based changefeeds, we should consider switching error handling model from "errors are fatal, unless marked as retryable" to "errors are retryable, unless marked fatal". We should also augment retries with addressing #62556 (do not retry forever without making progress). |
I could get behind this, especially if we do a good job on #62556 |
Retryable errors are already reliably marked as such, see how the pgerror package detects them. |
My feeling for defaulting to retrying for jobs on errors is that #44594 should be a prerequisite. Having a job spin on failure seems very bad. |
Assuming it correctly marks all retriable errors, there is still work here on the consumer side to make sure the changefeed machinery looks for and responds to the correct retriable types.
Seems reasonable. |
Prior to this PR, changefeeds would rely on a white list approach in order to determine which errors were retryable. All other errors would be deemed terminal, causing changefeed to fail. The above approach is brittle, and causes unwanted changefeed termination. This PR changes this approach to treat all errors as retryable, unless otherwise indicated. Errors that are known by changefeed to be fatal are handled explicitly, by marking such errors as terminal. For example, changefeeds would exit if the targetted table is dropped. On the other hand, inability to read this table for any reason would not be treated as terminal. Fixes cockroachdb#90320 Fixes cockroachdb#77549 Fixes cockroachdb#63317 Fixes cockroachdb#71341 Fixes cockroachdb#73016 Informs CRDB-6788 Release note (enterprise change): Changefeed will now treat all errors, unless otherwise indicated, as retryable errors.
Prior to this PR, changefeeds would rely on a white list approach in order to determine which errors were retryable. All other errors would be deemed terminal, causing changefeed to fail. The above approach is brittle, and causes unwanted changefeed termination. This PR changes this approach to treat all errors as retryable, unless otherwise indicated. Errors that are known by changefeed to be fatal are handled explicitly, by marking such errors as terminal. For example, changefeeds would exit if the targetted table is dropped. On the other hand, inability to read this table for any reason would not be treated as terminal. Fixes cockroachdb#90320 Fixes cockroachdb#77549 Fixes cockroachdb#63317 Fixes cockroachdb#71341 Fixes cockroachdb#73016 Informs CRDB-6788 Informs CRDB-7581 Release note (enterprise change): Changefeed will now treat all errors, unless otherwise indicated, as retryable errors.
Prior to this PR, changefeeds would rely on a white list approach in order to determine which errors were retryable. All other errors would be deemed terminal, causing changefeed to fail. The above approach is brittle, and causes unwanted changefeed termination. This PR changes this approach to treat all errors as retryable, unless otherwise indicated. Errors that are known by changefeed to be fatal are handled explicitly, by marking such errors as terminal. For example, changefeeds would exit if the targetted table is dropped. On the other hand, inability to read this table for any reason would not be treated as terminal. Fixes cockroachdb#90320 Fixes cockroachdb#77549 Fixes cockroachdb#63317 Fixes cockroachdb#71341 Fixes cockroachdb#73016 Informs CRDB-6788 Informs CRDB-7581 Release note (enterprise change): Changefeed will now treat all errors, unless otherwise indicated, as retryable errors.
Prior to this PR, changefeeds would rely on a white list approach in order to determine which errors were retryable. All other errors would be deemed terminal, causing changefeed to fail. The above approach is brittle, and causes unwanted changefeed termination. This PR changes this approach to treat all errors as retryable, unless otherwise indicated. Errors that are known by changefeed to be fatal are handled explicitly, by marking such errors as terminal. For example, changefeeds would exit if the targetted table is dropped. On the other hand, inability to read this table for any reason would not be treated as terminal. Fixes cockroachdb#90320 Fixes cockroachdb#77549 Fixes cockroachdb#63317 Fixes cockroachdb#71341 Fixes cockroachdb#73016 Informs CRDB-6788 Informs CRDB-7581 Release note (enterprise change): Changefeed will now treat all errors, unless otherwise indicated, as retryable errors.
90810: changefeedccl: Rework error handling r=miretskiy a=miretskiy Prior to this PR, changefeeds would rely on a white list approach in order to determine which errors were retryable. All other errors would be deemed terminal, causing changefeed to fail. The above approach is brittle, and causes unwanted changefeed termination. This PR changes this approach to treat all errors as retryable, unless otherwise indicated. Errors that are known by changefeed to be fatal are handled explicitly, by marking such errors as terminal. For example, changefeeds would exit if the targeted table is dropped. On the other hand, inability to read this table for any reason would not be treated as terminal. Fixes #90320 Fixes #77549 Fixes #63317 Fixes #71341 Fixes #73016 Informs CRDB-6788 Informs CRDB-7581 Release Note (enterprise change): Changefeed will now treat all errors, unless otherwise indicated, as retryable errors. Co-authored-by: Yevgeniy Miretskiy <[email protected]>
Prior to this PR, changefeeds would rely on a white list approach in order to determine which errors were retryable. All other errors would be deemed terminal, causing changefeed to fail. The above approach is brittle, and causes unwanted changefeed termination. This PR changes this approach to treat all errors as retryable, unless otherwise indicated. Errors that are known by changefeed to be fatal are handled explicitly, by marking such errors as terminal. For example, changefeeds would exit if the targetted table is dropped. On the other hand, inability to read this table for any reason would not be treated as terminal. Fixes cockroachdb#90320 Fixes cockroachdb#77549 Fixes cockroachdb#63317 Fixes cockroachdb#71341 Fixes cockroachdb#73016 Informs CRDB-6788 Informs CRDB-7581 Release note (enterprise change): Changefeed will now treat all errors, unless otherwise indicated, as retryable errors.
Describe the problem
In #63282 we saw that a clock drift issue led to one node being terminated. This then resulted in a changefeed failure because the node termination led to a failure to fetch a table descriptor during schema polling.
It appears that all errors from the schemafeed are treated as fatal, even though they might be caused by transient communication issues.
We may want identify which of the many error returns in fetchDescriptorVersions may be retriable. It seems likely that at least this ought to be:
Note that
pErr.GoError()
will return an UnhandledRetriableError in some cases, but the rest of the changefeed code doesn't look for that particular type. And, it is isn't clear to me without further verification that that will cover all of the transient errors the function could return.Jira issue: CRDB-6524
The text was updated successfully, but these errors were encountered: