release-21.1: sql: fix deadlock when updating backfill progress #69130

blathers-crl · 2021-08-19T03:55:41Z

Backport 1/1 commits from #69040 on behalf of @ajwerner.

/cc @cockroachdb/release

The root cause here is that we acquired the mutex inside the transaction which
also laid down intents. This was not a problem in earlier iterations of this
code because of the FOR UPDATE logic which would, generally, in theory, order
the transactions such that the first one to acquire the mutex would be the
first to lay down an intent, thus avoiding the deadlock by ordering the
acquisitions. That was changed in #68244, which removed the FOR UPDATE.

What we see now is that you have a transaction doing the progress update which
hits a restart but has laid down an intent. Then we have a transaction which
is doing a details update that starts and acquires the mutex but blocks on the
intent of the other transaction. That other transaction now is blocked on the
mutex and we have a deadlock.

The solution here is to not acquire the mutex inside these transactions.
Instead, the code copies out the relevant state prior to issuing the
transaction. The cost here should be pretty minimal and the staleness in
the fact of retries is the least of my concerns.

No release note because the code in #68244 has never been released.

Touches #68951, #68958.

Release note: None

Release justification:

The root cause here is that we acquired the mutex inside the transaction which also laid down intents. This was not a problem in earlier iterations of this code because of the FOR UPDATE logic which would, generally, in theory, order the transactions such that the first one to acquire the mutex would be the first to lay down an intent, thus avoiding the deadlock by ordering the acquisitions. That was changed in #68244, which removed the FOR UPDATE. What we see now is that you have a transaction doing the progress update which hits a restart but has laid down an intent. Then we have a transaction which is doing a details update that starts and acquires the mutex but blocks on the intent of the other transaction. That other transaction now is blocked on the mutex and we have a deadlock. The solution here is to not acquire the mutex inside these transactions. Instead, the code copies out the relevant state prior to issuing the transaction. The cost here should be pretty minimal and the staleness in the fact of retries is the least of my concerns. No release note because the code in #68244 has never been released. Release note: None

blathers-crl · 2021-08-19T03:55:44Z

Thanks for opening a backport.

Please check the backport criteria before merging:

Patches should only be created for serious issues.
Patches should not break backwards-compatibility.
Patches should change as little code as possible.
Patches should not change on-disk formats or node communication protocols.
Patches should not add new functionality.

If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.

There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters.
The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.

Add a brief release justification to the body of your PR to justify this backport.

Some other things to consider:

What did we do to ensure that a user that doesn’t know & care about this backport, has no idea that it happened?
Will this work in a cluster of mixed patch versions? Did we test that?
If a user upgrades a patch version, uses this feature, and then downgrades, what happens?

cockroach-teamcity · 2021-08-19T03:55:51Z

This change is

blathers-crl bot force-pushed the blathers/backport-release-21.1-69040 branch from b51fd2c to ec7ed11 Compare August 19, 2021 03:55

blathers-crl bot requested review from adityamaru, dt and fqazi August 19, 2021 03:55

blathers-crl bot assigned ajwerner Aug 19, 2021

dt approved these changes Aug 19, 2021

View reviewed changes

ajwerner merged commit 4b016de into release-21.1 Aug 22, 2021

This was referenced Aug 23, 2021

roachtest: schemachange/index/tpcc/w=1000 failed #68958

Closed

roachtest: schemachange/invertedindex failed #69109

Closed

roachtest: schemachange/bulkingest failed #68951

Closed

rafiss deleted the blathers/backport-release-21.1-69040 branch November 22, 2021 06:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release-21.1: sql: fix deadlock when updating backfill progress #69130

release-21.1: sql: fix deadlock when updating backfill progress #69130

blathers-crl bot commented Aug 19, 2021

blathers-crl bot commented Aug 19, 2021 •

edited by ajwerner

Loading

cockroach-teamcity commented Aug 19, 2021

release-21.1: sql: fix deadlock when updating backfill progress #69130

release-21.1: sql: fix deadlock when updating backfill progress #69130

Conversation

blathers-crl bot commented Aug 19, 2021

blathers-crl bot commented Aug 19, 2021 • edited by ajwerner Loading

cockroach-teamcity commented Aug 19, 2021

blathers-crl bot commented Aug 19, 2021 •

edited by ajwerner

Loading