Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changefeedccl: add on_error option to pause changefeeds on failure #68176

Merged
merged 1 commit into from
Jul 30, 2021

Conversation

spiffyy99
Copy link
Contributor

Previously, changefeeds always failed when encountering a non-
retryable error. This option allows the user to pause on failure
and resume later, while still failing as default behavior.

Release note (enterprise change): new 'on_error' option to pause
on non-retryable errors instead of failing.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@spiffyy99 spiffyy99 force-pushed the pause-job-on-fail branch 3 times, most recently from 6a5022c to 66bc601 Compare July 28, 2021 16:40
@spiffyy99 spiffyy99 requested a review from a team July 28, 2021 18:32
@spiffyy99 spiffyy99 added A-cdc Change Data Capture T-cdc labels Jul 28, 2021
@@ -665,8 +709,6 @@ func (b *changefeedResumer) Resume(ctx context.Context, execCtx interface{}) err
progress = reloadedJob.Progress()
}
}
// We only hit this if `r.Next()` returns false, which right now only happens
// on context cancellation.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outdated comment: we can hit this if the retry count is exhausted as well

Copy link
Contributor

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!
Just few minor nits/questions.

Reviewed 6 of 6 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @spiffyyeng)


pkg/ccl/changefeedccl/changefeed_stmt.go, line 555 at r1 (raw file):

		default:
			return jobspb.ChangefeedDetails{}, errors.Errorf(
				`unknown %s: %s`, opt, v)

I would add maybe a a hint (errors.WithHint), or just expand the error message with "valid values are....."


pkg/ccl/changefeedccl/changefeed_stmt.go, line 629 at r1 (raw file):

				return pauseErr
			}
			// TODO (ryan min): Populate pause reason with error once column is added (#67928)

We already have that. See running status (see changefeed_processors).
In order to update run status and indicate that we're pausing because of this policy, as opposed to user initiated action, we will probably need to export pauseRequested method (as well as onPauseRequestFunc) so that we can directly call b.job.PauseRequested, and update run status in the passed in function.


pkg/ccl/changefeedccl/changefeed_stmt.go, line 633 at r1 (raw file):

Quoted 18 lines of code…
		switch onError := changefeedbase.OnErrorType(details.Opts[changefeedbase.OptOnError]); onError {
		// default behavior
		case changefeedbase.OptOnErrorFail:
			return err
		// pause instead of failing
		case changefeedbase.OptOnErrorPause:
			// note: we only want the job to pause here if a failure happens, not a
			// user-initiated cancellation. if the job has been canceled, the ctx
			// will handle it and the pause will return an error.
			pauseErr := execCfg.JobRegistry.PauseRequested(ctx, jobExec.ExtendedEvalContext().Txn, jobID)
			if pauseErr != nil {
				return pauseErr
			}
			// TODO (ryan min): Populate pause reason with error once column is added (#67928)
		default:
			return errors.Errorf("unrecognized option value: %s=%s",
				changefeedbase.OptOnError, details.Opts[changefeedbase.OptOnError])
		}

I would move this into a helper handleChangefeedError() which returns an error (or nil)..
So, this code becomes:

if err != nil {
   err = handleChangefeedError(....)
}
return err

pkg/ccl/changefeedccl/changefeed_test.go, line 4265 at r1 (raw file):

			// check for paused status on failure
			err := feedJob.WaitForStatus(func(s jobs.Status) bool { return s == jobs.StatusPaused })

require.NoError(t, feedJob.WaitForStatus...)?


pkg/ccl/changefeedccl/changefeed_test.go, line 4345 at r1 (raw file):

	t.Run(`cloudstorage`, cloudStorageTest(testFn))
	t.Run(`kafka`, kafkaTest(testFn))
	t.Run(`webhook`, webhookTest(testFn))

nice test.

@spiffyy99 spiffyy99 force-pushed the pause-job-on-fail branch from 66bc601 to 4c803b9 Compare July 29, 2021 21:11
@spiffyy99 spiffyy99 requested a review from a team July 29, 2021 21:11
@spiffyy99 spiffyy99 force-pushed the pause-job-on-fail branch from 4c803b9 to 43a342e Compare July 30, 2021 17:42
@spiffyy99 spiffyy99 requested a review from miretskiy July 30, 2021 17:44
Copy link
Contributor Author

@spiffyy99 spiffyy99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy and @spiffyyeng)


pkg/ccl/changefeedccl/changefeed_stmt.go, line 555 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

I would add maybe a a hint (errors.WithHint), or just expand the error message with "valid values are....."

Done.


pkg/ccl/changefeedccl/changefeed_stmt.go, line 629 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

We already have that. See running status (see changefeed_processors).
In order to update run status and indicate that we're pausing because of this policy, as opposed to user initiated action, we will probably need to export pauseRequested method (as well as onPauseRequestFunc) so that we can directly call b.job.PauseRequested, and update run status in the passed in function.

discussed offline. run status is updated internally in the progress column but is not visible in SHOW JOBS when paused.


pkg/ccl/changefeedccl/changefeed_stmt.go, line 633 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…
		switch onError := changefeedbase.OnErrorType(details.Opts[changefeedbase.OptOnError]); onError {
		// default behavior
		case changefeedbase.OptOnErrorFail:
			return err
		// pause instead of failing
		case changefeedbase.OptOnErrorPause:
			// note: we only want the job to pause here if a failure happens, not a
			// user-initiated cancellation. if the job has been canceled, the ctx
			// will handle it and the pause will return an error.
			pauseErr := execCfg.JobRegistry.PauseRequested(ctx, jobExec.ExtendedEvalContext().Txn, jobID)
			if pauseErr != nil {
				return pauseErr
			}
			// TODO (ryan min): Populate pause reason with error once column is added (#67928)
		default:
			return errors.Errorf("unrecognized option value: %s=%s",
				changefeedbase.OptOnError, details.Opts[changefeedbase.OptOnError])
		}

I would move this into a helper handleChangefeedError() which returns an error (or nil)..
So, this code becomes:

if err != nil {
   err = handleChangefeedError(....)
}
return err

Done.


pkg/ccl/changefeedccl/changefeed_test.go, line 4265 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

require.NoError(t, feedJob.WaitForStatus...)?

Done.

Copy link
Contributor

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm_strong:

Reviewed 1 of 4 files at r3.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @miretskiy and @spiffyyeng)


pkg/ccl/changefeedccl/changefeed_stmt.go, line 646 at r3 (raw file):

			// directly update running status to avoid the running/reverted job status check
			progress.RunningStatus = fmt.Sprintf("paused on error: %v", changefeedErr)
			log.Warningf(ctx, progress.RunningStatus)

nit: Let's be more verbose here

log.Warningf(ctx, "job failed (%v) but is being paused because of on_error=pause", err)

@miretskiy miretskiy self-requested a review July 30, 2021 18:09
@spiffyy99 spiffyy99 force-pushed the pause-job-on-fail branch from 43a342e to 274c8ef Compare July 30, 2021 18:29
Copy link
Contributor Author

@spiffyy99 spiffyy99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @miretskiy and @spiffyyeng)


pkg/ccl/changefeedccl/changefeed_stmt.go, line 646 at r3 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

nit: Let's be more verbose here

log.Warningf(ctx, "job failed (%v) but is being paused because of on_error=pause", err)

Done.

@spiffyy99
Copy link
Contributor Author

spiffyy99 commented Jul 30, 2021

TFTR!

Previously, changefeeds always failed when encountering a non-
retryable error. This option allows the user to pause on failure
and resume later, while still failing as default behavior.

Resolves cockroachdb#67965

Release note (enterprise change): new 'on_error' option to pause
on non-retryable errors instead of failing.
@spiffyy99 spiffyy99 force-pushed the pause-job-on-fail branch from 274c8ef to 602b1a7 Compare July 30, 2021 19:51
@spiffyy99
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Jul 30, 2021

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cdc Change Data Capture T-cdc
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants