release-21.1: jobs: remove FOR UPDATE clause when updating job #68244

blathers-crl · 2021-07-29T17:37:21Z

Backport 1/1 commits from #67660 on behalf of @ajwerner.

/cc @cockroachdb/release

In cockroachdb currently, the FOR UPDATE lock in an exclusive lock. That
means that both clients trying to inspect jobs and the job adoption loops will
both try to scan the table and encounter these locks. For the most part, we
don't really update the job from the leaves of a distsql flow. There is an
exception which is IMPORT incrementing a sequence. Nevertheless, the retry
behavior there seems sound. The other exception is pausing or canceling jobs.
I think that in that case we prefer to invalidate the work of the transaction
as our intention is to cancel it.

If cockroach implemented UPGRADE locks (#49684), then this FOR UPDATE would
not be a problem.

Release note (performance improvement): Jobs no longer hold exclusive locks
during the duration of their checkpointing transactions which can result in
long wait times when trying to run SHOW JOBS.

Release justification:

In cockroachdb currently, the `FOR UPDATE` lock in an exclusive lock. That means that both clients trying to inspect jobs and the job adoption loops will both try to scan the table and encounter these locks. For the most part, we don't really update the job from the leaves of a distsql flow. There is an exception which is IMPORT incrementing a sequence. In that case, which motivated the initial locking addition, we'll leave the locking. The other exception is pausing or canceling jobs. I think that in that case we prefer to invalidate the work of the transaction as our intention is to cancel it. If cockroach implemented UPGRADE locks (#49684), then this FOR UPDATE would not be a problem. Release note (performance improvement): Jobs no longer hold exclusive locks during the duration of their checkpointing transactions which can result in long wait times when trying to run SHOW JOBS.

blathers-crl · 2021-07-29T17:37:25Z

Thanks for opening a backport.

Please check the backport criteria before merging:

Patches should only be created for serious issues.
Patches should not break backwards-compatibility.
Patches should change as little code as possible.
Patches should not change on-disk formats or node communication protocols.
Patches should not add new functionality.

If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.

There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters.
The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.

Add a brief release justification to the body of your PR to justify this backport.

Some other things to consider:

What did we do to ensure that a user that doesn’t know & care about this backport, has no idea that it happened?
Will this work in a cluster of mixed patch versions? Did we test that?
If a user upgrades a patch version, uses this feature, and then downgrades, what happens?

cockroach-teamcity · 2021-07-29T17:37:30Z

This change is

nvanbenschoten

LGTM, but this seems like the kind of change to let bake for 2 weeks on master before merging the backport. What do you think?

ajwerner · 2021-07-29T17:58:02Z

seems like the kind of change to let bake for 2 weeks on master before merging the backport.

I fully agree.

shermanCRL · 2021-08-10T17:28:53Z

Would be great to have this in 21.1.8.

ajwerner · 2021-08-10T17:49:34Z

Would be great to have this in 21.1.8.

Okay, pressing the button.

The root cause here is that we acquired the mutex inside the transaction which also laid down intents. This was not a problem in earlier iterations of this code because of the FOR UPDATE logic which would, generally, in theory, order the transactions such that the first one to acquire the mutex would be the first to lay down an intent, thus avoiding the deadlock by ordering the acquisitions. That was changed in cockroachdb#68244, which removed the FOR UPDATE. What we see now is that you have a transaction doing the progress update which hits a restart but has laid down an intent. Then we have a transaction which is doing a details update that starts and acquires the mutex but blocks on the intent of the other transaction. That other transaction now is blocked on the mutex and we have a deadlock. The solution here is to not acquire the mutex inside these transactions. Instead, the code copies out the relevant state prior to issuing the transaction. The cost here should be pretty minimal and the staleness in the fact of retries is the least of my concerns. No release note because the code in cockroachdb#68244 has never been released. Release note: None

69040: sql: fix deadlock when updating backfill progress r=ajwerner a=ajwerner The root cause here is that we acquired the mutex inside the transaction which also laid down intents. This was not a problem in earlier iterations of this code because of the FOR UPDATE logic which would, generally, in theory, order the transactions such that the first one to acquire the mutex would be the first to lay down an intent, thus avoiding the deadlock by ordering the acquisitions. That was changed in #68244, which removed the FOR UPDATE. What we see now is that you have a transaction doing the progress update which hits a restart but has laid down an intent. Then we have a transaction which is doing a details update that starts and acquires the mutex but blocks on the intent of the other transaction. That other transaction now is blocked on the mutex and we have a deadlock. The solution here is to not acquire the mutex inside these transactions. Instead, the code copies out the relevant state prior to issuing the transaction. The cost here should be pretty minimal and the staleness in the fact of retries is the least of my concerns. No release note because the code in #68244 has never been released. Touches #68951, #68958. Release note: None Co-authored-by: Andrew Werner <[email protected]>

The root cause here is that we acquired the mutex inside the transaction which also laid down intents. This was not a problem in earlier iterations of this code because of the FOR UPDATE logic which would, generally, in theory, order the transactions such that the first one to acquire the mutex would be the first to lay down an intent, thus avoiding the deadlock by ordering the acquisitions. That was changed in #68244, which removed the FOR UPDATE. What we see now is that you have a transaction doing the progress update which hits a restart but has laid down an intent. Then we have a transaction which is doing a details update that starts and acquires the mutex but blocks on the intent of the other transaction. That other transaction now is blocked on the mutex and we have a deadlock. The solution here is to not acquire the mutex inside these transactions. Instead, the code copies out the relevant state prior to issuing the transaction. The cost here should be pretty minimal and the staleness in the fact of retries is the least of my concerns. No release note because the code in #68244 has never been released. Release note: None

blathers-crl bot requested a review from a team July 29, 2021 17:37

blathers-crl bot force-pushed the blathers/backport-release-21.1-67660 branch from ad61bb5 to 84654db Compare July 29, 2021 17:37

blathers-crl bot requested review from adityamaru, ajwerner, nvanbenschoten and sajjadrizvi July 29, 2021 17:37

blathers-crl bot assigned ajwerner Jul 29, 2021

nvanbenschoten approved these changes Jul 29, 2021

View reviewed changes

ajwerner merged commit 4143139 into release-21.1 Aug 10, 2021

ajwerner mentioned this pull request Aug 17, 2021

roachtest: schemachange/bulkingest failed #68951

Closed

ajwerner mentioned this pull request Aug 17, 2021

sql: fix deadlock when updating backfill progress #69040

Merged

blathers-crl bot mentioned this pull request Aug 19, 2021

release-21.1: sql: fix deadlock when updating backfill progress #69130

Merged

rafiss deleted the blathers/backport-release-21.1-67660 branch November 22, 2021 06:45

mikeCRL mentioned this pull request Nov 22, 2021

release-21.1: jobs: remove FOR UPDATE clause when updating job cockroachdb/docs#12429

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release-21.1: jobs: remove FOR UPDATE clause when updating job #68244

release-21.1: jobs: remove FOR UPDATE clause when updating job #68244

blathers-crl bot commented Jul 29, 2021

blathers-crl bot commented Jul 29, 2021 •

edited by ajwerner

Loading

cockroach-teamcity commented Jul 29, 2021

nvanbenschoten left a comment

ajwerner commented Jul 29, 2021

shermanCRL commented Aug 10, 2021

ajwerner commented Aug 10, 2021

release-21.1: jobs: remove FOR UPDATE clause when updating job #68244

release-21.1: jobs: remove FOR UPDATE clause when updating job #68244

Conversation

blathers-crl bot commented Jul 29, 2021

blathers-crl bot commented Jul 29, 2021 • edited by ajwerner Loading

cockroach-teamcity commented Jul 29, 2021

nvanbenschoten left a comment

Choose a reason for hiding this comment

ajwerner commented Jul 29, 2021

shermanCRL commented Aug 10, 2021

ajwerner commented Aug 10, 2021

blathers-crl bot commented Jul 29, 2021 •

edited by ajwerner

Loading