Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
storage: make leaving joint config idempotent
ChangeReplicas (and AdminSplit, and AdminMerge) take a RangeDescriptor that they use as a basis for a CPut to make sure the operations mutating the range serialize. This is great for correctness but generally unfortunate for usability since on a mismatch, the caller usually wanted to do the thing they were trying to do anyway, using the new descriptor. The fact that every split (replication change, merge) basically needs a retry loop is constant trickle of test flakes and UX papercuts. It became more pressing to do something against this as we are routinely using joint configs when atomic replication changes are enabled. A joint configuration is transitioned out of opportunistically whenever it is encountered, but this frequently causes a race in which actor A finds a joint config, begins a transaction out of it but is raced by actor B getting there first. The end result is that what actor A wanted to achieve has been achieved, though by someone else, and the result is a spurious error. This commit fixes that behavior in the targeted case of wanting to leave a joint configuration: actor A will get a successful result. Before this change, make stress PKG=./pkg/sql TESTS=TestShowRangesWithLocal would fail immediately when `kv.atomic_replication_changes.enabled` was true because the splits this test carries out would run into the joint configuration changes of the initial upreplication, and would race the replicate queue to transition out of them, which at least one split would typically lose. This still happens, but now it's not an error any more. I do think that it makes sense to use a similar strategy in general (fail replication changes only if the *replicas* changed, allow all splits except when the split key moves out of the current descriptor, etc) but in the process of coding this up I got reminded of all of the problems relating to range merges and also found what I think is a long-standing pretty fatal bug, cockroachdb#40367, so I don't want to do anything until the release is out of the door. But I'm basically convinced that if we did it, it wouldn't cause a new "bug" because any replication change carried out in that way is just one that could be triggered just the same by a user under the old checks. Release note: None
- Loading branch information