Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spanconfig: ensure safety without mutual exclusion in the reconciliation job #73789

Closed
irfansharif opened this issue Dec 14, 2021 · 3 comments · Fixed by #79171 or #80196
Closed

spanconfig: ensure safety without mutual exclusion in the reconciliation job #73789

irfansharif opened this issue Dec 14, 2021 · 3 comments · Fixed by #79171 or #80196
Assignees
Labels
A-zone-configs branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. GA-blocker

Comments

@irfansharif
Copy link
Contributor

irfansharif commented Dec 14, 2021

Describe the problem

We introduced a singleton AUTO SPAN CONFIG RECONCILATION job in #68522. We use this per-tenant job to drive the reconciliation process between a tenant's SQL config state (zone configs) with the cluster's KV state (span configs) -- more details found in the accompanying RFC. The RFC relies on strict mutual exclusion guarantees, needing to ensure that at a given point in time, only a single reconciliation instance is up and running. We're currently failing to provide this guarantee as described in the thread here: #71994 (review). This issue tracks fixing this bug.

Jira issue: CRDB-11751

@irfansharif irfansharif added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-zone-configs labels Dec 14, 2021
@nvanbenschoten
Copy link
Member

Thanks for writing this up. My suggestion would be to give up any hope of true mutual exclusion. It doesn't exist in a fault-tolerant distributed system without making synchrony assumptions. That's what FLP is all about. So instead of asking "how do I get mutual exclusion?", ask "how do I ensure liveness and safety when I have mutual exclusion, and how do I ensure safety when I do not?".

Fencing is a common technique to accomplish this. An example of this is the use of lease indexes in KV. Each leaseholder attaches increasing MaxLeaseIndex values to Raft proposals. Below Raft, this MaxLeaseIndex is checked against the current LeaseAppliedIndex in a compare-and-swap-like validation step. If the CAS fails, the "committed" Raft entry is prevented from applying its state machine update. So even if a lease does not provide true mutual exclusion, an old leaseholder can't disrupt a new one.

@irfansharif irfansharif changed the title spanconfig: ensure mutual exclusion in the reconciliation job spanconfig: ensure safety without mutual exclusion in the reconciliation job Dec 15, 2021
@ajwerner
Copy link
Contributor

A straw-man proposal would be to use some incrementing integer which we increment in the job on the sql side and then pass into requests to KV. Then KV could reject any write that has an integer that's lower than the highest it has seen. The transaction to increment it on the sql side could prove that it has a lease on the job. This will require a little bit of extra state.

@arulajmani arulajmani added release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 labels Mar 23, 2022
@irfansharif irfansharif added GA-blocker and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 24, 2022
@arulajmani arulajmani self-assigned this Mar 30, 2022
@craig craig bot closed this as completed in #79171 Apr 18, 2022
@irfansharif
Copy link
Contributor Author

Re-opening to track the bug fix + additional testing in #80196.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-zone-configs branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. GA-blocker
Projects
None yet
6 participants