-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
spanconfig: ensure safety without mutual exclusion in the reconciliation job #73789
Comments
Thanks for writing this up. My suggestion would be to give up any hope of true mutual exclusion. It doesn't exist in a fault-tolerant distributed system without making synchrony assumptions. That's what FLP is all about. So instead of asking "how do I get mutual exclusion?", ask "how do I ensure liveness and safety when I have mutual exclusion, and how do I ensure safety when I do not?". Fencing is a common technique to accomplish this. An example of this is the use of lease indexes in KV. Each leaseholder attaches increasing |
A straw-man proposal would be to use some incrementing integer which we increment in the job on the sql side and then pass into requests to KV. Then KV could reject any write that has an integer that's lower than the highest it has seen. The transaction to increment it on the sql side could prove that it has a lease on the job. This will require a little bit of extra state. |
Re-opening to track the bug fix + additional testing in #80196. |
Describe the problem
We introduced a singleton
AUTO SPAN CONFIG RECONCILATION
job in #68522. We use this per-tenant job to drive the reconciliation process between a tenant's SQL config state (zone configs) with the cluster's KV state (span configs) -- more details found in the accompanying RFC. The RFC relies on strict mutual exclusion guarantees, needing to ensure that at a given point in time, only a single reconciliation instance is up and running. We're currently failing to provide this guarantee as described in the thread here: #71994 (review). This issue tracks fixing this bug.Jira issue: CRDB-11751
The text was updated successfully, but these errors were encountered: