-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
jobs: clear job claim after execution #91563
jobs: clear job claim after execution #91563
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this.
pkg/jobs/adopt.go
Outdated
// caller since the caller's context may have been canceled. | ||
r.withSession(r.serverCtx, func(ctx context.Context, s sqlliveness.Session) { | ||
err := r.db.Txn(ctx, func(ctx context.Context, txn *kv.Txn) error { | ||
if err := txn.SetUserPriority(roachpb.MinUserPriority); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was mostly following what we do in other lifecyle queries in this package (claimJobs and servePauseAndCancelRequests). However, those happen constantly on all nodes so are more likely to cause repeated contention whereas one hopes this is relatively infrequently called, so perhaps we can do without it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the other point queries in this package don't do this. The idea to use low priority for the scans was to not starve user queries or point operations.
pkg/jobs/adopt.go
Outdated
// We use the serverCtx here rather than the context from the | ||
// caller since the caller's context may have been canceled. | ||
r.withSession(r.serverCtx, func(ctx context.Context, s sqlliveness.Session) { | ||
err := r.db.Txn(ctx, func(ctx context.Context, txn *kv.Txn) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'd be better off not using a txn loop if we can avoid it. This is a txn which can use parallel commits. I'd rather that it did that by using a nil txn passed to the ie. I think if you want low priority you maybe can do that through session data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated this and decided to go forward without a low priority query for now.
Do you happen to know if there is a good way to verify that the query is in fact using parralel commit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A KV trace would show you. Not exactly sure how to get that in this context. I think maybe you could install something into sessiondata and then the context?
Like, if you did:
root@localhost:26257/defaultdb> create table foo (i int primary key, j int, index(j));
CREATE TABLE
Time: 897ms total (execution 897ms / network 0ms)
root@localhost:26257/defaultdb> set tracing = kv;
SET TRACING
Time: 0ms total (execution 0ms / network 0ms)
root@localhost:26257/defaultdb> insert into foo values (5, 1);
INSERT 0 1
Time: 45ms total (execution 44ms / network 0ms)
root@localhost:26257/defaultdb> show kv trace for session;
timestamp | age | message | tag | location | operation | span
--------------------------------+-----------------+--------------------------------------------------+----------------------------------------------------+-----------------------------------------+--------------------------------------+-------
2022-11-10 15:35:54.82937+00 | 00:00:00.003108 | rows affected: 1 | [n1,client=127.0.0.1:54700,user=root] | sql/conn_executor_exec.go:723 | sql query | 2
2022-11-10 15:35:59.56418+00 | 00:00:04.737918 | CPut /Table/105/1/5/0 -> /TUPLE/2:2:Int/1 | [n1,client=127.0.0.1:54700,user=root] | sql/row/writer.go:215 | count | 18
2022-11-10 15:35:59.564228+00 | 00:00:04.737966 | InitPut /Table/105/2/1/5/0 -> /BYTES/ | [n1,client=127.0.0.1:54700,user=root] | sql/row/inserter.go:183 | count | 18
2022-11-10 15:35:59.564313+00 | 00:00:04.738051 | querying next range at /Table/105/1/5/0 | [n1,client=127.0.0.1:54700,user=root,txn=5ad5ea61] | kv/kvclient/kvcoord/range_iter.go:182 | dist sender send | 20
2022-11-10 15:35:59.564371+00 | 00:00:04.738109 | querying next range at /Table/105/2/1/5/0 | [n1,client=127.0.0.1:54700,user=root,txn=5ad5ea61] | kv/kvclient/kvcoord/range_iter.go:182 | dist sender send | 20
2022-11-10 15:35:59.564409+00 | 00:00:04.738147 | r58: sending batch 1 InitPut to (n1,s1):1 | [n1,client=127.0.0.1:54700,user=root,txn=5ad5ea61] | kv/kvclient/kvcoord/dist_sender.go:2048 | dist sender send | 20
2022-11-10 15:35:59.564476+00 | 00:00:04.738214 | r57: sending batch 1 CPut, 1 EndTxn to (n1,s1):1 | [n1,client=127.0.0.1:54700,user=root,txn=5ad5ea61] | kv/kvclient/kvcoord/dist_sender.go:2048 | kv.DistSender: sending partial batch | 21
2022-11-10 15:35:59.606589+00 | 00:00:04.780327 | fast path completed | [n1,client=127.0.0.1:54700,user=root] | sql/plan_node_to_row_source.go:176 | count | 18
2022-11-10 15:35:59.607296+00 | 00:00:04.781034 | rows affected: 1 | [n1,client=127.0.0.1:54700,user=root] | sql/conn_executor_exec.go:723 | sql query | 12
2022-11-10 15:35:59.610933+00 | 00:00:04.784671 | rows affected: 1 | [n1,client=127.0.0.1:54700,user=root] | sql/conn_executor_exec.go:723 | sql query | 27
(10 rows)
That's definitely a parallel commit.
b5149f1
to
ebf9b97
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @stevendanna)
Previously, if the ALTER CHANGEFEED txn retried, we would increment these counters again. In cockroachdb#91563 we are making a change that made it more likely that this transaction may retry during one of the test, revealing the issue. Release note: None
The test failure is #91812 |
Previously, if the ALTER CHANGEFEED txn retried, we would increment these counters again. In cockroachdb#91563 we are making a change that made it more likely that this transaction may retry during one of the test, revealing the issue. Release note: None
Since cockroachdb#89014 the job system reset a job's claim when transitioning it from pause-requested to paused and from cancel-requested to reverting. The job system signals these transitions to the running Resumer by cancelling the job's context and does not wait for the resumer to exit. Once the claim is clear, another node can adopt the job and start running it's OnFailOrCancel callback. As a result, clearing the context makes it more likely that OnFailOrCancel executions will overlap with Resume executions. In general, Jobs need to assume that Resume may still be running while OnFailOrCancel is called. But, making it more likely isn't in our interest. Here, we only clear the lease when we exit the job state machine. This makes it much more likely that OnFailOrCancel doesn't start until Resume has returned. Release note: None
Release note: None
ebf9b97
to
deb0896
Compare
bors r=ajwerner |
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from 2ba983d to blathers/backport-release-22.1-91563: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 22.1.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
The explicit transactions in this test can hit transaction retry errors despite the test conditions all passing. Here, we wrap the transactions we intend to commit in a retry loop using client-side retries. It seems likely that cockroachdb#91563 made transaction retries more likely. Fixes cockroachdb#92001 Release note: None
92005: jobs: deflake TestRegistryLifecycle r=ajwerner a=stevendanna The explicit transactions in this test can hit transaction retry errors despite the test conditions all passing. Here, we wrap the transactions we intend to commit in a retry loop using client-side retries. It seems likely that #91563 made transaction retries more likely. Fixes #92001 Release note: None Co-authored-by: Steven Danna <[email protected]>
The explicit transactions in this test can hit transaction retry errors despite the test conditions all passing. Here, we wrap the transactions we intend to commit in a retry loop using client-side retries. It seems likely that cockroachdb#91563 made transaction retries more likely. Fixes cockroachdb#92001 Release note: None
The explicit transactions in this test can hit transaction retry errors despite the test conditions all passing. Here, we wrap the transactions we intend to commit in a retry loop using client-side retries. It seems likely that cockroachdb#91563 made transaction retries more likely. Fixes cockroachdb#92001 Release note: None
Since #89014 the job system reset a job's claim when transitioning it from pause-requested to paused and from cancel-requested to reverting. The job system signals these transitions to the running Resumer by cancelling the job's context and does not wait for the resumer to exit. Once the claim is clear, another node can adopt the job and start running it's OnFailOrCancel callback. As a result, clearing the context makes it more likely that OnFailOrCancel executions will overlap with Resume executions.
In general, Jobs need to assume that Resume may still be running while OnFailOrCancel is called. But, making it more likely isn't in our interest.
Here, we only clear the lease when we exit the job state machine. This makes it much more likely that OnFailOrCancel doesn't start until Resume has returned.
Epic: None
Release note: None