jobs: clear job claim after execution #91563

stevendanna · 2022-11-09T00:38:04Z

Since #89014 the job system reset a job's claim when transitioning it from pause-requested to paused and from cancel-requested to reverting. The job system signals these transitions to the running Resumer by cancelling the job's context and does not wait for the resumer to exit. Once the claim is clear, another node can adopt the job and start running it's OnFailOrCancel callback. As a result, clearing the context makes it more likely that OnFailOrCancel executions will overlap with Resume executions.

In general, Jobs need to assume that Resume may still be running while OnFailOrCancel is called. But, making it more likely isn't in our interest.

Here, we only clear the lease when we exit the job state machine. This makes it much more likely that OnFailOrCancel doesn't start until Resume has returned.

Epic: None

Release note: None

cockroach-teamcity · 2022-11-09T00:38:12Z

This change is

ajwerner

I like this.

ajwerner · 2022-11-09T14:01:13Z

pkg/jobs/adopt.go

+	// caller since the caller's context may have been canceled.
+	r.withSession(r.serverCtx, func(ctx context.Context, s sqlliveness.Session) {
+		err := r.db.Txn(ctx, func(ctx context.Context, txn *kv.Txn) error {
+			if err := txn.SetUserPriority(roachpb.MinUserPriority); err != nil {


I was mostly following what we do in other lifecyle queries in this package (claimJobs and servePauseAndCancelRequests). However, those happen constantly on all nodes so are more likely to cause repeated contention whereas one hopes this is relatively infrequently called, so perhaps we can do without it.

Yeah, the other point queries in this package don't do this. The idea to use low priority for the scans was to not starve user queries or point operations.

ajwerner · 2022-11-09T14:02:56Z

pkg/jobs/adopt.go

+	// We use the serverCtx here rather than the context from the
+	// caller since the caller's context may have been canceled.
+	r.withSession(r.serverCtx, func(ctx context.Context, s sqlliveness.Session) {
+		err := r.db.Txn(ctx, func(ctx context.Context, txn *kv.Txn) error {


I think we'd be better off not using a txn loop if we can avoid it. This is a txn which can use parallel commits. I'd rather that it did that by using a nil txn passed to the ie. I think if you want low priority you maybe can do that through session data.

I've updated this and decided to go forward without a low priority query for now.

Do you happen to know if there is a good way to verify that the query is in fact using parralel commit?

ajwerner

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @stevendanna)

Previously, if the ALTER CHANGEFEED txn retried, we would increment these counters again. In cockroachdb#91563 we are making a change that made it more likely that this transaction may retry during one of the test, revealing the issue. Release note: None

stevendanna · 2022-11-13T19:39:46Z

The test failure is #91812

Previously, if the ALTER CHANGEFEED txn retried, we would increment these counters again. In cockroachdb#91563 we are making a change that made it more likely that this transaction may retry during one of the test, revealing the issue. Release note: None

Since cockroachdb#89014 the job system reset a job's claim when transitioning it from pause-requested to paused and from cancel-requested to reverting. The job system signals these transitions to the running Resumer by cancelling the job's context and does not wait for the resumer to exit. Once the claim is clear, another node can adopt the job and start running it's OnFailOrCancel callback. As a result, clearing the context makes it more likely that OnFailOrCancel executions will overlap with Resume executions. In general, Jobs need to assume that Resume may still be running while OnFailOrCancel is called. But, making it more likely isn't in our interest. Here, we only clear the lease when we exit the job state machine. This makes it much more likely that OnFailOrCancel doesn't start until Resume has returned. Release note: None

Release note: None

stevendanna · 2022-11-14T21:54:37Z

bors r=ajwerner

craig · 2022-11-14T22:48:38Z

Build succeeded:

Bazel Essential CI (Cockroach)

blathers-crl · 2022-11-14T22:49:07Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating merge commit from 2ba983d to blathers/backport-release-22.1-91563: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.1.x failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

The explicit transactions in this test can hit transaction retry errors despite the test conditions all passing. Here, we wrap the transactions we intend to commit in a retry loop using client-side retries. It seems likely that cockroachdb#91563 made transaction retries more likely. Fixes cockroachdb#92001 Release note: None

92005: jobs: deflake TestRegistryLifecycle r=ajwerner a=stevendanna The explicit transactions in this test can hit transaction retry errors despite the test conditions all passing. Here, we wrap the transactions we intend to commit in a retry loop using client-side retries. It seems likely that #91563 made transaction retries more likely. Fixes #92001 Release note: None Co-authored-by: Steven Danna <[email protected]>

The explicit transactions in this test can hit transaction retry errors despite the test conditions all passing. Here, we wrap the transactions we intend to commit in a retry loop using client-side retries. It seems likely that cockroachdb#91563 made transaction retries more likely. Fixes cockroachdb#92001 Release note: None

stevendanna requested a review from a team as a code owner November 9, 2022 00:38

ajwerner reviewed Nov 9, 2022

View reviewed changes

stevendanna mentioned this pull request Nov 9, 2022

sql/importer: more controlled shutdown during job cancellation #91615

Closed

stevendanna force-pushed the clear-job-claim-after-execution branch from b5149f1 to ebf9b97 Compare November 10, 2022 12:38

ajwerner approved these changes Nov 10, 2022

View reviewed changes

stevendanna added backport-22.1.x labels Nov 10, 2022

stevendanna mentioned this pull request Nov 13, 2022

changefeedccl: avoid telemetry double count on txn retry #91812

Closed

stevendanna added 2 commits November 14, 2022 20:40

jobs: add log scopes to tests

deb0896

Release note: None

stevendanna force-pushed the clear-job-claim-after-execution branch from ebf9b97 to deb0896 Compare November 14, 2022 20:40

stevendanna requested a review from a team as a code owner November 14, 2022 20:40

stevendanna requested review from shermanCRL and removed request for a team and shermanCRL November 14, 2022 20:40

craig bot merged commit a8b0cd9 into cockroachdb:master Nov 14, 2022

blathers-crl bot mentioned this pull request Nov 14, 2022

release-22.2: jobs: clear job claim after execution #91880

Closed

stevendanna mentioned this pull request Nov 16, 2022

jobs: deflake TestRegistryLifecycle #92005

Merged

stevendanna mentioned this pull request Dec 12, 2022

release-22.2: jobs: clear job claim after execution #93475

Merged

stevendanna mentioned this pull request Dec 13, 2022

jobs: clear job claim after execution #93556

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jobs: clear job claim after execution #91563

jobs: clear job claim after execution #91563

stevendanna commented Nov 9, 2022 •

edited

Loading

cockroach-teamcity commented Nov 9, 2022

ajwerner left a comment

ajwerner Nov 9, 2022

stevendanna Nov 9, 2022

ajwerner Nov 10, 2022

ajwerner Nov 9, 2022

stevendanna Nov 10, 2022

ajwerner Nov 10, 2022

ajwerner left a comment

stevendanna commented Nov 13, 2022

stevendanna commented Nov 14, 2022

craig bot commented Nov 14, 2022

blathers-crl bot commented Nov 14, 2022

jobs: clear job claim after execution #91563

jobs: clear job claim after execution #91563

Conversation

stevendanna commented Nov 9, 2022 • edited Loading

cockroach-teamcity commented Nov 9, 2022

ajwerner left a comment

Choose a reason for hiding this comment

ajwerner Nov 9, 2022

Choose a reason for hiding this comment

stevendanna Nov 9, 2022

Choose a reason for hiding this comment

ajwerner Nov 10, 2022

Choose a reason for hiding this comment

ajwerner Nov 9, 2022

Choose a reason for hiding this comment

stevendanna Nov 10, 2022

Choose a reason for hiding this comment

ajwerner Nov 10, 2022

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

stevendanna commented Nov 13, 2022

stevendanna commented Nov 14, 2022

craig bot commented Nov 14, 2022

blathers-crl bot commented Nov 14, 2022

stevendanna commented Nov 9, 2022 •

edited

Loading