Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jobs: clear job claim after execution #91563

Merged
merged 2 commits into from
Nov 14, 2022

Conversation

stevendanna
Copy link
Collaborator

@stevendanna stevendanna commented Nov 9, 2022

Since #89014 the job system reset a job's claim when transitioning it from pause-requested to paused and from cancel-requested to reverting. The job system signals these transitions to the running Resumer by cancelling the job's context and does not wait for the resumer to exit. Once the claim is clear, another node can adopt the job and start running it's OnFailOrCancel callback. As a result, clearing the context makes it more likely that OnFailOrCancel executions will overlap with Resume executions.

In general, Jobs need to assume that Resume may still be running while OnFailOrCancel is called. But, making it more likely isn't in our interest.

Here, we only clear the lease when we exit the job state machine. This makes it much more likely that OnFailOrCancel doesn't start until Resume has returned.

Epic: None

Release note: None

@stevendanna stevendanna requested a review from a team as a code owner November 9, 2022 00:38
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this.

// caller since the caller's context may have been canceled.
r.withSession(r.serverCtx, func(ctx context.Context, s sqlliveness.Session) {
err := r.db.Txn(ctx, func(ctx context.Context, txn *kv.Txn) error {
if err := txn.SetUserPriority(roachpb.MinUserPriority); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was mostly following what we do in other lifecyle queries in this package (claimJobs and servePauseAndCancelRequests). However, those happen constantly on all nodes so are more likely to cause repeated contention whereas one hopes this is relatively infrequently called, so perhaps we can do without it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the other point queries in this package don't do this. The idea to use low priority for the scans was to not starve user queries or point operations.

// We use the serverCtx here rather than the context from the
// caller since the caller's context may have been canceled.
r.withSession(r.serverCtx, func(ctx context.Context, s sqlliveness.Session) {
err := r.db.Txn(ctx, func(ctx context.Context, txn *kv.Txn) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'd be better off not using a txn loop if we can avoid it. This is a txn which can use parallel commits. I'd rather that it did that by using a nil txn passed to the ie. I think if you want low priority you maybe can do that through session data.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated this and decided to go forward without a low priority query for now.

Do you happen to know if there is a good way to verify that the query is in fact using parralel commit?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A KV trace would show you. Not exactly sure how to get that in this context. I think maybe you could install something into sessiondata and then the context?

Like, if you did:

root@localhost:26257/defaultdb> create table foo (i int primary key, j int, index(j));
CREATE TABLE


Time: 897ms total (execution 897ms / network 0ms)

root@localhost:26257/defaultdb> set tracing = kv;
SET TRACING


Time: 0ms total (execution 0ms / network 0ms)

root@localhost:26257/defaultdb> insert into foo values (5, 1);
INSERT 0 1


Time: 45ms total (execution 44ms / network 0ms)

root@localhost:26257/defaultdb> show kv  trace for session;
            timestamp           |       age       |                     message                      |                        tag                         |                location                 |              operation               | span
--------------------------------+-----------------+--------------------------------------------------+----------------------------------------------------+-----------------------------------------+--------------------------------------+-------
  2022-11-10 15:35:54.82937+00  | 00:00:00.003108 | rows affected: 1                                 | [n1,client=127.0.0.1:54700,user=root]              | sql/conn_executor_exec.go:723           | sql query                            |    2
  2022-11-10 15:35:59.56418+00  | 00:00:04.737918 | CPut /Table/105/1/5/0 -> /TUPLE/2:2:Int/1        | [n1,client=127.0.0.1:54700,user=root]              | sql/row/writer.go:215                   | count                                |   18
  2022-11-10 15:35:59.564228+00 | 00:00:04.737966 | InitPut /Table/105/2/1/5/0 -> /BYTES/            | [n1,client=127.0.0.1:54700,user=root]              | sql/row/inserter.go:183                 | count                                |   18
  2022-11-10 15:35:59.564313+00 | 00:00:04.738051 | querying next range at /Table/105/1/5/0          | [n1,client=127.0.0.1:54700,user=root,txn=5ad5ea61] | kv/kvclient/kvcoord/range_iter.go:182   | dist sender send                     |   20
  2022-11-10 15:35:59.564371+00 | 00:00:04.738109 | querying next range at /Table/105/2/1/5/0        | [n1,client=127.0.0.1:54700,user=root,txn=5ad5ea61] | kv/kvclient/kvcoord/range_iter.go:182   | dist sender send                     |   20
  2022-11-10 15:35:59.564409+00 | 00:00:04.738147 | r58: sending batch 1 InitPut to (n1,s1):1        | [n1,client=127.0.0.1:54700,user=root,txn=5ad5ea61] | kv/kvclient/kvcoord/dist_sender.go:2048 | dist sender send                     |   20
  2022-11-10 15:35:59.564476+00 | 00:00:04.738214 | r57: sending batch 1 CPut, 1 EndTxn to (n1,s1):1 | [n1,client=127.0.0.1:54700,user=root,txn=5ad5ea61] | kv/kvclient/kvcoord/dist_sender.go:2048 | kv.DistSender: sending partial batch |   21
  2022-11-10 15:35:59.606589+00 | 00:00:04.780327 | fast path completed                              | [n1,client=127.0.0.1:54700,user=root]              | sql/plan_node_to_row_source.go:176      | count                                |   18
  2022-11-10 15:35:59.607296+00 | 00:00:04.781034 | rows affected: 1                                 | [n1,client=127.0.0.1:54700,user=root]              | sql/conn_executor_exec.go:723           | sql query                            |   12
  2022-11-10 15:35:59.610933+00 | 00:00:04.784671 | rows affected: 1                                 | [n1,client=127.0.0.1:54700,user=root]              | sql/conn_executor_exec.go:723           | sql query                            |   27
(10 rows)

That's definitely a parallel commit.

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @stevendanna)

stevendanna added a commit to stevendanna/cockroach that referenced this pull request Nov 13, 2022
Previously, if the ALTER CHANGEFEED txn retried, we would increment
these counters again.

In cockroachdb#91563 we are making a change that made it more likely that this
transaction may retry during one of the test, revealing the issue.

Release note: None
@stevendanna
Copy link
Collaborator Author

The test failure is #91812

stevendanna added a commit to stevendanna/cockroach that referenced this pull request Nov 13, 2022
Previously, if the ALTER CHANGEFEED txn retried, we would increment
these counters again.

In cockroachdb#91563 we are making a change that made it more likely that this
transaction may retry during one of the test, revealing the issue.

Release note: None
Since cockroachdb#89014 the job system reset a job's claim when transitioning it
from pause-requested to paused and from cancel-requested to
reverting. The job system signals these transitions to the running
Resumer by cancelling the job's context and does not wait for the
resumer to exit. Once the claim is clear, another node can adopt the
job and start running it's OnFailOrCancel callback. As a result,
clearing the context makes it more likely that OnFailOrCancel
executions will overlap with Resume executions.

In general, Jobs need to assume that Resume may still be running while
OnFailOrCancel is called. But, making it more likely isn't in our
interest.

Here, we only clear the lease when we exit the job state machine.
This makes it much more likely that OnFailOrCancel doesn't start until
Resume has returned.

Release note: None
Release note: None
@stevendanna stevendanna force-pushed the clear-job-claim-after-execution branch from ebf9b97 to deb0896 Compare November 14, 2022 20:40
@stevendanna stevendanna requested a review from a team as a code owner November 14, 2022 20:40
@stevendanna stevendanna requested review from shermanCRL and removed request for a team and shermanCRL November 14, 2022 20:40
@stevendanna
Copy link
Collaborator Author

bors r=ajwerner

@craig
Copy link
Contributor

craig bot commented Nov 14, 2022

Build succeeded:

@craig craig bot merged commit a8b0cd9 into cockroachdb:master Nov 14, 2022
@blathers-crl
Copy link

blathers-crl bot commented Nov 14, 2022

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 2ba983d to blathers/backport-release-22.1-91563: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.1.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

stevendanna added a commit to stevendanna/cockroach that referenced this pull request Nov 16, 2022
The explicit transactions in this test can hit transaction retry
errors despite the test conditions all passing.

Here, we wrap the transactions we intend to commit in a retry loop
using client-side retries.

It seems likely that cockroachdb#91563 made transaction retries more likely.

Fixes cockroachdb#92001

Release note: None
craig bot pushed a commit that referenced this pull request Nov 16, 2022
92005: jobs: deflake TestRegistryLifecycle r=ajwerner a=stevendanna

The explicit transactions in this test can hit transaction retry errors despite the test conditions all passing.

Here, we wrap the transactions we intend to commit in a retry loop using client-side retries.

It seems likely that #91563 made transaction retries more likely.

Fixes #92001

Release note: None

Co-authored-by: Steven Danna <[email protected]>
stevendanna added a commit to stevendanna/cockroach that referenced this pull request Dec 12, 2022
The explicit transactions in this test can hit transaction retry
errors despite the test conditions all passing.

Here, we wrap the transactions we intend to commit in a retry loop
using client-side retries.

It seems likely that cockroachdb#91563 made transaction retries more likely.

Fixes cockroachdb#92001

Release note: None
stevendanna added a commit to stevendanna/cockroach that referenced this pull request Jan 3, 2023
The explicit transactions in this test can hit transaction retry
errors despite the test conditions all passing.

Here, we wrap the transactions we intend to commit in a retry loop
using client-side retries.

It seems likely that cockroachdb#91563 made transaction retries more likely.

Fixes cockroachdb#92001

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants