Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-21.1: jobs: remove FOR UPDATE clause when updating job #68244

Merged
merged 1 commit into from
Aug 10, 2021

Conversation

blathers-crl[bot]
Copy link

@blathers-crl blathers-crl bot commented Jul 29, 2021

Backport 1/1 commits from #67660 on behalf of @ajwerner.

/cc @cockroachdb/release


In cockroachdb currently, the FOR UPDATE lock in an exclusive lock. That
means that both clients trying to inspect jobs and the job adoption loops will
both try to scan the table and encounter these locks. For the most part, we
don't really update the job from the leaves of a distsql flow. There is an
exception which is IMPORT incrementing a sequence. Nevertheless, the retry
behavior there seems sound. The other exception is pausing or canceling jobs.
I think that in that case we prefer to invalidate the work of the transaction
as our intention is to cancel it.

If cockroach implemented UPGRADE locks (#49684), then this FOR UPDATE would
not be a problem.

Release note (performance improvement): Jobs no longer hold exclusive locks
during the duration of their checkpointing transactions which can result in
long wait times when trying to run SHOW JOBS.


Release justification:

In cockroachdb currently, the `FOR UPDATE` lock in an exclusive lock. That
means that both clients trying to inspect jobs and the job adoption loops will
both try to scan the table and encounter these locks. For the most part, we
don't really update the job from the leaves of a distsql flow.

There is an exception which is IMPORT incrementing a sequence. In that case,
which motivated the initial locking addition, we'll leave the locking.

The other exception is pausing or canceling jobs. I think that in that case
we prefer to invalidate the work of the transaction as our intention is to
cancel it.

If cockroach implemented UPGRADE locks (#49684), then this FOR UPDATE would
not be a problem.

Release note (performance improvement): Jobs no longer hold exclusive locks
during the duration of their checkpointing transactions which can result in
long wait times when trying to run SHOW JOBS.
@blathers-crl blathers-crl bot requested a review from a team July 29, 2021 17:37
@blathers-crl blathers-crl bot force-pushed the blathers/backport-release-21.1-67660 branch from ad61bb5 to 84654db Compare July 29, 2021 17:37
@blathers-crl
Copy link
Author

blathers-crl bot commented Jul 29, 2021

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Patches should only be created for serious issues.
  • Patches should not break backwards-compatibility.
  • Patches should change as little code as possible.
  • Patches should not change on-disk formats or node communication protocols.
  • Patches should not add new functionality.
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters.
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.

Add a brief release justification to the body of your PR to justify this backport.

Some other things to consider:

  • What did we do to ensure that a user that doesn’t know & care about this backport, has no idea that it happened?
  • Will this work in a cluster of mixed patch versions? Did we test that?
  • If a user upgrades a patch version, uses this feature, and then downgrades, what happens?

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but this seems like the kind of change to let bake for 2 weeks on master before merging the backport. What do you think?

@ajwerner
Copy link
Contributor

seems like the kind of change to let bake for 2 weeks on master before merging the backport.

I fully agree.

@shermanCRL
Copy link
Contributor

Would be great to have this in 21.1.8.

@ajwerner
Copy link
Contributor

Would be great to have this in 21.1.8.

Okay, pressing the button.

@ajwerner ajwerner merged commit 4143139 into release-21.1 Aug 10, 2021
ajwerner added a commit to ajwerner/cockroach that referenced this pull request Aug 17, 2021
The root cause here is that we acquired the mutex inside the transaction which
also laid down intents. This was not a problem in earlier iterations of this
code because of the FOR UPDATE logic which would, generally, in theory, order
the transactions such that the first one to acquire the mutex would be the
first to lay down an intent, thus avoiding the deadlock by ordering the
acquisitions. That was changed in cockroachdb#68244, which removed the FOR UPDATE.

What we see now is that you have a transaction doing the progress update which
hits a restart but has laid down an intent. Then we have a transaction which
is doing a details update that starts and acquires the mutex but blocks on the
intent of the other transaction. That other transaction now is blocked on the
mutex and we have a deadlock.

The solution here is to not acquire the mutex inside these transactions.
Instead, the code copies out the relevant state prior to issuing the
transaction. The cost here should be pretty minimal and the staleness in
the fact of retries is the least of my concerns.

No release note because the code in cockroachdb#68244 has never been released.

Release note: None
ajwerner added a commit to ajwerner/cockroach that referenced this pull request Aug 17, 2021
The root cause here is that we acquired the mutex inside the transaction which
also laid down intents. This was not a problem in earlier iterations of this
code because of the FOR UPDATE logic which would, generally, in theory, order
the transactions such that the first one to acquire the mutex would be the
first to lay down an intent, thus avoiding the deadlock by ordering the
acquisitions. That was changed in cockroachdb#68244, which removed the FOR UPDATE.

What we see now is that you have a transaction doing the progress update which
hits a restart but has laid down an intent. Then we have a transaction which
is doing a details update that starts and acquires the mutex but blocks on the
intent of the other transaction. That other transaction now is blocked on the
mutex and we have a deadlock.

The solution here is to not acquire the mutex inside these transactions.
Instead, the code copies out the relevant state prior to issuing the
transaction. The cost here should be pretty minimal and the staleness in
the fact of retries is the least of my concerns.

No release note because the code in cockroachdb#68244 has never been released.

Release note: None
ajwerner added a commit to ajwerner/cockroach that referenced this pull request Aug 18, 2021
The root cause here is that we acquired the mutex inside the transaction which
also laid down intents. This was not a problem in earlier iterations of this
code because of the FOR UPDATE logic which would, generally, in theory, order
the transactions such that the first one to acquire the mutex would be the
first to lay down an intent, thus avoiding the deadlock by ordering the
acquisitions. That was changed in cockroachdb#68244, which removed the FOR UPDATE.

What we see now is that you have a transaction doing the progress update which
hits a restart but has laid down an intent. Then we have a transaction which
is doing a details update that starts and acquires the mutex but blocks on the
intent of the other transaction. That other transaction now is blocked on the
mutex and we have a deadlock.

The solution here is to not acquire the mutex inside these transactions.
Instead, the code copies out the relevant state prior to issuing the
transaction. The cost here should be pretty minimal and the staleness in
the fact of retries is the least of my concerns.

No release note because the code in cockroachdb#68244 has never been released.

Release note: None
craig bot pushed a commit that referenced this pull request Aug 19, 2021
69040: sql: fix deadlock when updating backfill progress r=ajwerner a=ajwerner

The root cause here is that we acquired the mutex inside the transaction which
also laid down intents. This was not a problem in earlier iterations of this
code because of the FOR UPDATE logic which would, generally, in theory, order
the transactions such that the first one to acquire the mutex would be the
first to lay down an intent, thus avoiding the deadlock by ordering the
acquisitions. That was changed in #68244, which removed the FOR UPDATE.

What we see now is that you have a transaction doing the progress update which
hits a restart but has laid down an intent. Then we have a transaction which
is doing a details update that starts and acquires the mutex but blocks on the
intent of the other transaction. That other transaction now is blocked on the
mutex and we have a deadlock.

The solution here is to not acquire the mutex inside these transactions.
Instead, the code copies out the relevant state prior to issuing the
transaction. The cost here should be pretty minimal and the staleness in
the fact of retries is the least of my concerns.

No release note because the code in #68244 has never been released.

Touches #68951, #68958.

Release note: None

Co-authored-by: Andrew Werner <[email protected]>
blathers-crl bot pushed a commit that referenced this pull request Aug 19, 2021
The root cause here is that we acquired the mutex inside the transaction which
also laid down intents. This was not a problem in earlier iterations of this
code because of the FOR UPDATE logic which would, generally, in theory, order
the transactions such that the first one to acquire the mutex would be the
first to lay down an intent, thus avoiding the deadlock by ordering the
acquisitions. That was changed in #68244, which removed the FOR UPDATE.

What we see now is that you have a transaction doing the progress update which
hits a restart but has laid down an intent. Then we have a transaction which
is doing a details update that starts and acquires the mutex but blocks on the
intent of the other transaction. That other transaction now is blocked on the
mutex and we have a deadlock.

The solution here is to not acquire the mutex inside these transactions.
Instead, the code copies out the relevant state prior to issuing the
transaction. The cost here should be pretty minimal and the staleness in
the fact of retries is the least of my concerns.

No release note because the code in #68244 has never been released.

Release note: None
@rafiss rafiss deleted the blathers/backport-release-21.1-67660 branch November 22, 2021 06:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants