Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jobs: clear job claim after execution #93556

Conversation

stevendanna
Copy link
Collaborator

@stevendanna stevendanna commented Dec 13, 2022

Since #89014 the job system reset a job's claim when transitioning it
from pause-requested to paused and from cancel-requested to
reverting. The job system signals these transitions to the running
Resumer by cancelling the job's context and does not wait for the
resumer to exit. Once the claim is clear, another node can adopt the
job and start running it's OnFailOrCancel callback. As a result,
clearing the context makes it more likely that OnFailOrCancel
executions will overlap with Resume executions.

In general, Jobs need to assume that Resume may still be running while
OnFailOrCancel is called. But, making it more likely isn't in our
interest.

Here, we only clear the lease when we exit the job state machine.
This makes it much more likely that OnFailOrCancel doesn't start until
Resume has returned.

Release note: None

Epic: none

Backport:

Please see individual PRs for details.

/cc @cockroachdb/release

@stevendanna stevendanna requested a review from a team as a code owner December 13, 2022 23:30
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@stevendanna stevendanna force-pushed the backport22.1-91563-91884-92005-92121 branch from 11baedf to 769fd70 Compare December 21, 2022 11:44
Since cockroachdb#89014 the job system reset a job's claim when transitioning it
from pause-requested to paused and from cancel-requested to
reverting. The job system signals these transitions to the running
Resumer by cancelling the job's context and does not wait for the
resumer to exit. Once the claim is clear, another node can adopt the
job and start running it's OnFailOrCancel callback. As a result,
clearing the context makes it more likely that OnFailOrCancel
executions will overlap with Resume executions.

In general, Jobs need to assume that Resume may still be running while
OnFailOrCancel is called. But, making it more likely isn't in our
interest.

Here, we only clear the lease when we exit the job state machine.
This makes it much more likely that OnFailOrCancel doesn't start until
Resume has returned.

Release note: None

Epic: none
Release note: None
The job system clears the lease asyncronously after cancelation. This
lease clearing transaction can cause a restart in the alter changefeed
transaction, which will lead to different feature counter
counts. Thus, we want to wait for the lease clear. However, the lease
clear isn't guaranteed to happen, so we only wait a few seconds for
it.

Release note: None

Epic: none
The explicit transactions in this test can hit transaction retry
errors despite the test conditions all passing.

Here, we wrap the transactions we intend to commit in a retry loop
using client-side retries.

It seems likely that cockroachdb#91563 made transaction retries more likely.

Fixes cockroachdb#92001

Release note: None
Previously we only cleared the claim after the state machine returned
and only if the status wasn't pause-requested or
cancel-requested. This filter on status, however, was unnecessary.

The job may still be in the cancel-requested or pause-requested state
when we go to clear the claim because the transaction that resulted in
the canceled context may not have completed. But, it is still fine to
clear the claim. There are 1 of two cases:

1) Either the transaction that cancelled us fails and we are thus
   still in the state cancel-requested or paused-requested with no
   claim. This is fine. The adoption loop will adopt the job and move
   the state to paused or reverting, just with no context to cancel.

2) The transaction succeeds and we are in paused or reverting without
   a claim set. Just as we wanted.

Here we remove the where clause to always clear the claim when we
return from the state machine.

In the case of (1), when processing the cancel-requested or
paused-requested state the second time, we may still want the claim
cleared. Here, we make sure it gets cleared even in the case where
there is no running job that actually needs to be canceled.

Fixes cockroachdb#92112

Release note: None
@stevendanna stevendanna force-pushed the backport22.1-91563-91884-92005-92121 branch from 769fd70 to aa4f388 Compare January 3, 2023 09:53
@stevendanna stevendanna merged commit 8dfa17e into cockroachdb:release-22.1 Jan 3, 2023
@stevendanna stevendanna deleted the backport22.1-91563-91884-92005-92121 branch January 3, 2023 15:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants