Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jobs: Clear out claim info when pausing #89014

Merged
merged 1 commit into from
Sep 30, 2022
Merged

Conversation

miretskiy
Copy link
Contributor

Clear out job claim information when job is paused. Clearing out claim information is beneficial since it allows operator to pause/resume job if they want to try to move job coordinator to another node.

Addresses #82698

Release note: none

@miretskiy miretskiy requested a review from a team as a code owner September 29, 2022 17:53
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @miretskiy)

@miretskiy
Copy link
Contributor Author

Clear out job claim information when job is paused.
Clearing out claim information is beneficial since it allows
operator to pause/resume job if they want to try to move
job coordinator to another node.

Release note: none
@miretskiy
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Sep 30, 2022

Build succeeded:

@craig craig bot merged commit 3e03176 into cockroachdb:master Sep 30, 2022
stevendanna added a commit to stevendanna/cockroach that referenced this pull request Nov 9, 2022
Since cockroachdb#89014 the job system reset a job's claim when transitioning it
from pause-requested to paused and from cancel-requested to
reverting. The job system signals these transitions to the running
Resumer by cancelling the job's context and does not wait for the
resumer to exit. Once the claim is clear, another node can adopt the
job and start running it's OnFailOrCancel callback. As a result,
clearing the context makes it more likely that OnFailOrCancel
executions will overlap with Resume executions.

In general, Jobs need to assume that Resume may still be running while
OnFailOrCancel is called. But, making it more likely isn't in our
interest.

Here, we only clear the lease when we exit the job state machine.
This makes it much more likely that OnFailOrCancel doesn't start until
Resume has returned.

Release note: None
stevendanna added a commit to stevendanna/cockroach that referenced this pull request Nov 9, 2022
Since cockroachdb#89014 the job system reset a job's claim when transitioning it
from pause-requested to paused and from cancel-requested to
reverting. The job system signals these transitions to the running
Resumer by cancelling the job's context and does not wait for the
resumer to exit. Once the claim is clear, another node can adopt the
job and start running it's OnFailOrCancel callback. As a result,
clearing the context makes it more likely that OnFailOrCancel
executions will overlap with Resume executions.

In general, Jobs need to assume that Resume may still be running while
OnFailOrCancel is called. But, making it more likely isn't in our
interest.

Here, we only clear the lease when we exit the job state machine.
This makes it much more likely that OnFailOrCancel doesn't start until
Resume has returned.

Release note: None
stevendanna added a commit to stevendanna/cockroach that referenced this pull request Nov 10, 2022
Since cockroachdb#89014 the job system reset a job's claim when transitioning it
from pause-requested to paused and from cancel-requested to
reverting. The job system signals these transitions to the running
Resumer by cancelling the job's context and does not wait for the
resumer to exit. Once the claim is clear, another node can adopt the
job and start running it's OnFailOrCancel callback. As a result,
clearing the context makes it more likely that OnFailOrCancel
executions will overlap with Resume executions.

In general, Jobs need to assume that Resume may still be running while
OnFailOrCancel is called. But, making it more likely isn't in our
interest.

Here, we only clear the lease when we exit the job state machine.
This makes it much more likely that OnFailOrCancel doesn't start until
Resume has returned.

Release note: None
stevendanna added a commit to stevendanna/cockroach that referenced this pull request Nov 14, 2022
Since cockroachdb#89014 the job system reset a job's claim when transitioning it
from pause-requested to paused and from cancel-requested to
reverting. The job system signals these transitions to the running
Resumer by cancelling the job's context and does not wait for the
resumer to exit. Once the claim is clear, another node can adopt the
job and start running it's OnFailOrCancel callback. As a result,
clearing the context makes it more likely that OnFailOrCancel
executions will overlap with Resume executions.

In general, Jobs need to assume that Resume may still be running while
OnFailOrCancel is called. But, making it more likely isn't in our
interest.

Here, we only clear the lease when we exit the job state machine.
This makes it much more likely that OnFailOrCancel doesn't start until
Resume has returned.

Release note: None
craig bot pushed a commit that referenced this pull request Nov 14, 2022
91160: sql: do not print stack trace when logging if txn is not open r=yuzefovich a=yuzefovich

After executing each statement, that statement might be logged. If there
were any audit events, then we attempt to resolve the table names for
which the audit events have occurred. To do the resolution we're using
the current txn. Previously, if that txn has been aborted or committed,
it would result in a scary-looking stack trace added to the log, and
this commit fixes it.

Epic: None

Release note: None

91563: jobs: clear job claim after execution r=ajwerner a=stevendanna

Since #89014 the job system reset a job's claim when transitioning it from pause-requested to paused and from cancel-requested to reverting. The job system signals these transitions to the running Resumer by cancelling the job's context and does not wait for the resumer to exit. Once the claim is clear, another node can adopt the job and start running it's OnFailOrCancel callback. As a result, clearing the context makes it more likely that OnFailOrCancel executions will overlap with Resume executions.

In general, Jobs need to assume that Resume may still be running while OnFailOrCancel is called. But, making it more likely isn't in our interest.

Here, we only clear the lease when we exit the job state machine. This makes it much more likely that OnFailOrCancel doesn't start until Resume has returned.

Epic: None

Release note: None

91874: colflow: temporarily disable test assertion about closers r=yuzefovich a=yuzefovich

Informs: #91845.

Epic: None

Release note: None

Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: Steven Danna <[email protected]>
blathers-crl bot pushed a commit that referenced this pull request Nov 14, 2022
Since #89014 the job system reset a job's claim when transitioning it
from pause-requested to paused and from cancel-requested to
reverting. The job system signals these transitions to the running
Resumer by cancelling the job's context and does not wait for the
resumer to exit. Once the claim is clear, another node can adopt the
job and start running it's OnFailOrCancel callback. As a result,
clearing the context makes it more likely that OnFailOrCancel
executions will overlap with Resume executions.

In general, Jobs need to assume that Resume may still be running while
OnFailOrCancel is called. But, making it more likely isn't in our
interest.

Here, we only clear the lease when we exit the job state machine.
This makes it much more likely that OnFailOrCancel doesn't start until
Resume has returned.

Release note: None
stevendanna added a commit to stevendanna/cockroach that referenced this pull request Dec 12, 2022
Since cockroachdb#89014 the job system reset a job's claim when transitioning it
from pause-requested to paused and from cancel-requested to
reverting. The job system signals these transitions to the running
Resumer by cancelling the job's context and does not wait for the
resumer to exit. Once the claim is clear, another node can adopt the
job and start running it's OnFailOrCancel callback. As a result,
clearing the context makes it more likely that OnFailOrCancel
executions will overlap with Resume executions.

In general, Jobs need to assume that Resume may still be running while
OnFailOrCancel is called. But, making it more likely isn't in our
interest.

Here, we only clear the lease when we exit the job state machine.
This makes it much more likely that OnFailOrCancel doesn't start until
Resume has returned.

Release note: None
stevendanna added a commit to stevendanna/cockroach that referenced this pull request Jan 3, 2023
Since cockroachdb#89014 the job system reset a job's claim when transitioning it
from pause-requested to paused and from cancel-requested to
reverting. The job system signals these transitions to the running
Resumer by cancelling the job's context and does not wait for the
resumer to exit. Once the claim is clear, another node can adopt the
job and start running it's OnFailOrCancel callback. As a result,
clearing the context makes it more likely that OnFailOrCancel
executions will overlap with Resume executions.

In general, Jobs need to assume that Resume may still be running while
OnFailOrCancel is called. But, making it more likely isn't in our
interest.

Here, we only clear the lease when we exit the job state machine.
This makes it much more likely that OnFailOrCancel doesn't start until
Resume has returned.

Release note: None

Epic: none
stevendanna added a commit to stevendanna/cockroach that referenced this pull request Mar 30, 2023
Since cockroachdb#89014 the job system reset a job's claim when transitioning it
from pause-requested to paused and from cancel-requested to
reverting. The job system signals these transitions to the running
Resumer by cancelling the job's context and does not wait for the
resumer to exit. Once the claim is clear, another node can adopt the
job and start running it's OnFailOrCancel callback. As a result,
clearing the context makes it more likely that OnFailOrCancel
executions will overlap with Resume executions.

In general, Jobs need to assume that Resume may still be running while
OnFailOrCancel is called. But, making it more likely isn't in our
interest.

Here, we only clear the lease when we exit the job state machine.
This makes it much more likely that OnFailOrCancel doesn't start until
Resume has returned.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants