spanconfig: ensure safety without mutual exclusion in the reconciliation job #73789

irfansharif · 2021-12-14T12:53:29Z

Describe the problem

We introduced a singleton AUTO SPAN CONFIG RECONCILATION job in #68522. We use this per-tenant job to drive the reconciliation process between a tenant's SQL config state (zone configs) with the cluster's KV state (span configs) -- more details found in the accompanying RFC. The RFC relies on strict mutual exclusion guarantees, needing to ensure that at a given point in time, only a single reconciliation instance is up and running. We're currently failing to provide this guarantee as described in the thread here: #71994 (review). This issue tracks fixing this bug.

Jira issue: CRDB-11751

The text was updated successfully, but these errors were encountered:

nvanbenschoten · 2021-12-15T18:02:57Z

Thanks for writing this up. My suggestion would be to give up any hope of true mutual exclusion. It doesn't exist in a fault-tolerant distributed system without making synchrony assumptions. That's what FLP is all about. So instead of asking "how do I get mutual exclusion?", ask "how do I ensure liveness and safety when I have mutual exclusion, and how do I ensure safety when I do not?".

Fencing is a common technique to accomplish this. An example of this is the use of lease indexes in KV. Each leaseholder attaches increasing MaxLeaseIndex values to Raft proposals. Below Raft, this MaxLeaseIndex is checked against the current LeaseAppliedIndex in a compare-and-swap-like validation step. If the CAS fails, the "committed" Raft entry is prevented from applying its state machine update. So even if a lease does not provide true mutual exclusion, an old leaseholder can't disrupt a new one.

ajwerner · 2021-12-15T18:52:47Z

A straw-man proposal would be to use some incrementing integer which we increment in the job on the sql side and then pass into requests to KV. Then KV could reject any write that has an integer that's lower than the highest it has seen. The transaction to increment it on the sql side could prove that it has a lease on the job. This will require a little bit of extra state.

irfansharif · 2022-04-20T17:43:44Z

Re-opening to track the bug fix + additional testing in #80196.

irfansharif added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-zone-configs labels Dec 14, 2021

irfansharif changed the title ~~spanconfig: ensure mutual exclusion in the reconciliation job~~ spanconfig: ensure safety without mutual exclusion in the reconciliation job Dec 15, 2021

irfansharif mentioned this issue Dec 15, 2021

spanconfig: harden infrastructure for v22.1 #73874

Closed

24 tasks

arulajmani added release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 labels Mar 23, 2022

irfansharif added GA-blocker and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 24, 2022

arulajmani self-assigned this Mar 30, 2022

irfansharif linked a pull request Apr 11, 2022 that will close this issue

spanconfig: consult job lease interval when updating span configs #79171

Merged

celiala added the blocks-22.1.0-rc.1 label Apr 18, 2022

craig bot closed this as completed in #79171 Apr 18, 2022

This was referenced Apr 19, 2022

spanconfig: add component metrics #80147

Closed

spanconfig: improve use of commit ts validity intervals #80196

Merged

irfansharif reopened this Apr 20, 2022

craig bot closed this as completed in c50ffb4 Apr 20, 2022

blathers-crl bot mentioned this issue Apr 20, 2022

release-22.1: spanconfig: improve use of commit ts validity intervals #80271

Merged

jlinder added the sync-me-3 label May 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spanconfig: ensure safety without mutual exclusion in the reconciliation job #73789

spanconfig: ensure safety without mutual exclusion in the reconciliation job #73789

irfansharif commented Dec 14, 2021 •

edited by cockroach-jira-scripts

Loading

nvanbenschoten commented Dec 15, 2021

ajwerner commented Dec 15, 2021

irfansharif commented Apr 20, 2022

spanconfig: ensure safety without mutual exclusion in the reconciliation job #73789

spanconfig: ensure safety without mutual exclusion in the reconciliation job #73789

Comments

irfansharif commented Dec 14, 2021 • edited by cockroach-jira-scripts Loading

nvanbenschoten commented Dec 15, 2021

ajwerner commented Dec 15, 2021

irfansharif commented Apr 20, 2022

irfansharif commented Dec 14, 2021 •

edited by cockroach-jira-scripts

Loading