Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql/catalog/lease: high priority reads in lease acquisition of OFFLINE descs may starve bulk operations #61798

Closed
ajwerner opened this issue Mar 10, 2021 · 2 comments · Fixed by #62611 or #62959
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Comments

@ajwerner
Copy link
Contributor

Describe the problem

In (#46170, #46384) we made changes to use high priority read transaction in lease acquisitions in order to break deadlocks. This solution isn't wonderful: if users use PRIORITY HIGH on their DDL transactions, they can still hit these deadlocks. Given people tend not to do that, this proved to be a pretty good and pragmatic solution to a very serious problem.

When code attempts to resolve descriptors it goes through the leasing mechanism. This is true even if the descriptor does not exist. This is made much worse in a case where you've got a multi-region cluster.

The scenario that causes specific pain is when you've got an application running and trying to hit some tables but you're in the process of restoring those tables. As part of the restore, we need to update the descriptors. If the restore job is far from the descriptor table leaseholder, then there's a long period of time for interceding queries to push the intent. After getting pushed, the query will need to refresh. If the pushing is happening more frequently than the refresh's duration, we'll get stuck in livelock.

For what it's worth, all of this is modestly speculative. We saw the pushing but I don't know that we've captured enough data.

To Reproduce

In theory, start a restore in a global cluster and get the leaseholder on the system config span in a different region than the restore coordinator. Then spam the database being restored and see if you can hold off the restore.

Some ideas

  • We could lease and cache the offline descriptor. This is probably the most compelling choice. It's not too hard.
  • We could make the lease acquisition low priority and fall back to reading from the store (probably not a good idea).
  • We could get rid of the high priority hack altogether with some work involving child transactions (see kv: child transactions #54633).

Expected behavior

The restore finishes.

@ajwerner ajwerner added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Mar 10, 2021
@pbardea
Copy link
Contributor

pbardea commented Mar 12, 2021

As an update, a reproduction was confirmed following the steps above. The command create table foo (a int primary key references bar(a)) was also able to get stuck on

return sc.txn(ctx, func(ctx context.Context, txn *kv.Txn, descsCol *descs.Collection) error {
while a distant node was being spammed with selects on foo.

@ajwerner
Copy link
Contributor Author

The short term fix which we'll backport will be to lease descriptors in the Adding state. This isn't a panacea but I think it will fix the immediate problem.

fqazi added a commit to fqazi/cockroach that referenced this issue Mar 25, 2021
Fixes: cockroachdb#61798

Previously, offline descriptors would never have their leases
cached and they would be released once the reference count hit zero.
This was inadequate because when attempting to online these tables
again the lease acquisition could be pushed back by other operations,
leading to starvation / live locks. To address this, this patch will
allow the leases of offline descriptors to be cached.

Release note (bug fix): Lease acquisitions of descriptor in a
offline state may starve out bulk operations (backup / restore)
fqazi added a commit to fqazi/cockroach that referenced this issue Mar 29, 2021
Fixes: cockroachdb#61798

Previously, offline descriptors would never have their leases
cached and they would be released once the reference count hit zero.
This was inadequate because when attempting to online these tables
again the lease acquisition could be pushed back by other operations,
leading to starvation / live locks. To address this, this patch will
allow the leases of offline descriptors to be cached.

Release note (bug fix): Lease acquisitions of descriptor in a
offline state may starve out bulk operations (backup / restore)
craig bot pushed a commit that referenced this issue Mar 30, 2021
62611: sql: lease acquisition of OFFLINE descs may starve bulk operations r=ajwerner a=fqazi

Fixes: #61798

Previously, offline descriptors would never have their leases
cached and they would be released once the reference count hit zero.
This was inadequate because when attempting to online these tables
again the lease acquisition could be pushed back by other operations,
leading to starvation / live locks. To address this, this patch will
allow the leases of offline descriptors to be cached.

Release note (bug fix): Lease acquisitions of descriptor in a
offline state may starve out bulk operations (backup / restore)

62764: sql: gate global_reads zone attribute on cluster version and enterprise license r=nvanbenschoten a=nvanbenschoten

Fixes #61039.

This commit gates the use of the global_reads zone configuration
attribute on the cluster version and on the use of an enterprise
license.

The first half of this is necessary for correctness. The second half is
not, because follower reads won't be served from global read followers
without an enterprise license, but a check when the zone configs are set
improves UX.

62778: geo/geomfn: refactor logic for point in polygon optimization  r=otan a=andyyang890

Release note: None

62815: build: upgrade go to 1.15.10 r=otan a=rickystewart

Pick up a patch to a compiler bug (see #62521).

* [ ] Adjust the Pebble tests to run in new version.
* [x] Adjust version in Docker image ([source](./builder/Dockerfile)).
* [x] Rebuild and push the Docker image (following [Basic Process](#basic-process))
* [x] Bump the version in `WORKSPACE` under `go_register_toolchains`. You may need to bump [rules_go](https://github.com/bazelbuild/rules_go/releases).
* [x] Bump the version in `builder.sh` accordingly ([source](./builder.sh#L6)).
* [ ] Bump the version in `go-version-check.sh` ([source](./go-version-check.sh)), unless bumping to a new patch release.
* [ ] Bump the go version in `go.mod`. You may also need to rerun `make vendor_rebuild` if vendoring has changed.
* [x] Bump the default installed version of Go in `bootstrap-debian.sh` ([source](./bootstrap/bootstrap-debian.sh#L40-42)).
* [x] Replace other mentions of the older version of go (grep for `golang:<old_version>` and `go<old_version>`).
* [ ] Update the `builder.dockerImage` parameter in the TeamCity [`Cockroach`](https://teamcity.cockroachdb.com/admin/editProject.html?projectId=Cockroach&tab=projectParams) and [`Internal`](https://teamcity.cockroachdb.com/admin/editProject.html?projectId=Internal&tab=projectParams) projects.
* [ ] Adjust `GO_VERSION` in the TeamCity agent image ([setup script](./packer/teamcity-agent.sh))
  and ask the Developer Infrastructure team to deploy new images.

Resolves #62809.

Release note (general change): Upgrade the CRDB binary to Go 1.15.10.

62823: schemaexpr: remove inapplicable TODO r=mgartner a=mgartner

Release note: None

Co-authored-by: Faizan Qazi <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Andy Yang <[email protected]>
Co-authored-by: Ricky Stewart <[email protected]>
Co-authored-by: Marcus Gartner <[email protected]>
@craig craig bot closed this as completed in 198fa1a Mar 30, 2021
@fqazi fqazi reopened this Apr 1, 2021
fqazi added a commit to fqazi/cockroach that referenced this issue Apr 1, 2021
Fixes: cockroachdb#61798

Previously, offline descriptors would never have their leases
cached and they would be released once the reference count hit zero.
This was inadequate because when attempting to online these tables
again the lease acquisition could be pushed back by other operations,
leading to starvation / live locks. To address this, this patch will
allow the leases of offline descriptors to be cached.

Release note (bug fix): Lease acquisitions of descriptor in a
offline state may starve out bulk operations (backup / restore)
craig bot pushed a commit that referenced this issue Apr 1, 2021
62954: sql: Re-enable multi_region_backup test r=arulajmani a=ajstorm

With #60835 merged, this test no longer flakes. I've stressed it on my
GCE worker now for a while an it's all good.

Resolves #60773.

Release note: None

62959: sql: lease acquisition of OFFLINE descs may starve bulk operations r=ajwerner a=fqazi

Fixes: #61798

Previously, offline descriptors would never have their leases
cached and they would be released once the reference count hit zero.
This was inadequate because when attempting to online these tables
again the lease acquisition could be pushed back by other operations,
leading to starvation / live locks. To address this, this patch will
allow the leases of offline descriptors to be cached.

Release note (bug fix): Lease acquisitions of descriptor in a
offline state may starve out bulk operations (backup / restore)

Co-authored-by: Adam Storm <[email protected]>
Co-authored-by: Faizan Qazi <[email protected]>
@craig craig bot closed this as completed in 4b6518c Apr 1, 2021
fqazi added a commit to fqazi/cockroach that referenced this issue Apr 7, 2021
Fixes: cockroachdb#61798

Previously, offline descriptors would never have their leases
cached and they would be released once the reference count hit zero.
This was inadequate because when attempting to online these tables
again the lease acquisition could be pushed back by other operations,
leading to starvation / live locks. To address this, this patch will
allow the leases of offline descriptors to be cached.

Release note (bug fix): Lease acquisitions of descriptor in a
offline state may starve out bulk operations (backup / restore)
fqazi added a commit to fqazi/cockroach that referenced this issue Apr 8, 2021
Fixes: cockroachdb#61798

Previously, offline descriptors would never have their leases
cached and they would be released once the reference count hit zero.
This was inadequate because when attempting to online these tables
again the lease acquisition could be pushed back by other operations,
leading to starvation / live locks. To address this, this patch will
allow the leases of offline descriptors to be cached.

Release note (bug fix): Lease acquisitions of descriptor in a
offline state may starve out bulk operations (backup / restore)
fqazi added a commit to fqazi/cockroach that referenced this issue Apr 13, 2021
Fixes: cockroachdb#61798

Previously, offline descriptors would never have their leases
cached and they would be released once the reference count hit zero.
This was inadequate because when attempting to online these tables
again the lease acquisition could be pushed back by other operations,
leading to starvation / live locks. To address this, this patch will
allow the leases of offline descriptors to be cached.

Release note (bug fix): Lease acquisitions of descriptor in a
offline state may starve out bulk operations (backup / restore)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
3 participants