-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
jobs: clear job claim after execution #93556
Merged
stevendanna
merged 6 commits into
cockroachdb:release-22.1
from
stevendanna:backport22.1-91563-91884-92005-92121
Jan 3, 2023
Merged
jobs: clear job claim after execution #93556
stevendanna
merged 6 commits into
cockroachdb:release-22.1
from
stevendanna:backport22.1-91563-91884-92005-92121
Jan 3, 2023
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
miretskiy
approved these changes
Dec 14, 2022
stevendanna
force-pushed
the
backport22.1-91563-91884-92005-92121
branch
from
December 21, 2022 11:44
11baedf
to
769fd70
Compare
Since cockroachdb#89014 the job system reset a job's claim when transitioning it from pause-requested to paused and from cancel-requested to reverting. The job system signals these transitions to the running Resumer by cancelling the job's context and does not wait for the resumer to exit. Once the claim is clear, another node can adopt the job and start running it's OnFailOrCancel callback. As a result, clearing the context makes it more likely that OnFailOrCancel executions will overlap with Resume executions. In general, Jobs need to assume that Resume may still be running while OnFailOrCancel is called. But, making it more likely isn't in our interest. Here, we only clear the lease when we exit the job state machine. This makes it much more likely that OnFailOrCancel doesn't start until Resume has returned. Release note: None Epic: none
Release note: None
Release note: None
The job system clears the lease asyncronously after cancelation. This lease clearing transaction can cause a restart in the alter changefeed transaction, which will lead to different feature counter counts. Thus, we want to wait for the lease clear. However, the lease clear isn't guaranteed to happen, so we only wait a few seconds for it. Release note: None Epic: none
The explicit transactions in this test can hit transaction retry errors despite the test conditions all passing. Here, we wrap the transactions we intend to commit in a retry loop using client-side retries. It seems likely that cockroachdb#91563 made transaction retries more likely. Fixes cockroachdb#92001 Release note: None
Previously we only cleared the claim after the state machine returned and only if the status wasn't pause-requested or cancel-requested. This filter on status, however, was unnecessary. The job may still be in the cancel-requested or pause-requested state when we go to clear the claim because the transaction that resulted in the canceled context may not have completed. But, it is still fine to clear the claim. There are 1 of two cases: 1) Either the transaction that cancelled us fails and we are thus still in the state cancel-requested or paused-requested with no claim. This is fine. The adoption loop will adopt the job and move the state to paused or reverting, just with no context to cancel. 2) The transaction succeeds and we are in paused or reverting without a claim set. Just as we wanted. Here we remove the where clause to always clear the claim when we return from the state machine. In the case of (1), when processing the cancel-requested or paused-requested state the second time, we may still want the claim cleared. Here, we make sure it gets cleared even in the case where there is no running job that actually needs to be canceled. Fixes cockroachdb#92112 Release note: None
stevendanna
force-pushed
the
backport22.1-91563-91884-92005-92121
branch
from
January 3, 2023 09:53
769fd70
to
aa4f388
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Since #89014 the job system reset a job's claim when transitioning it
from pause-requested to paused and from cancel-requested to
reverting. The job system signals these transitions to the running
Resumer by cancelling the job's context and does not wait for the
resumer to exit. Once the claim is clear, another node can adopt the
job and start running it's OnFailOrCancel callback. As a result,
clearing the context makes it more likely that OnFailOrCancel
executions will overlap with Resume executions.
In general, Jobs need to assume that Resume may still be running while
OnFailOrCancel is called. But, making it more likely isn't in our
interest.
Here, we only clear the lease when we exit the job state machine.
This makes it much more likely that OnFailOrCancel doesn't start until
Resume has returned.
Release note: None
Epic: none
Backport:
Please see individual PRs for details.
/cc @cockroachdb/release