-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
release-22.1: kvserver,spanconfig: add requisite version gates #98094
release-22.1: kvserver,spanconfig: add requisite version gates #98094
Conversation
Thanks for opening a backport. Please check the backport criteria before merging:
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
Add a brief release justification to the body of your PR to justify this backport. Some other things to consider:
|
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
7fad424
to
703c98f
Compare
(Added a unit test that fails before these changes.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something makes your test sad under race. Maybe skip it?
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks, @arulajmani, and @nvanbenschoten)
In an internal support issue[1] we observed that a mixed-version cluster straddling 21.2 and 22.1 (where span configs were first introduced) with a 5x replication factor, the 22.1 nodes were incorrectly down replicating ranges towards 3x replication. This happened because we were missing requisite version gates in KV code when deciding what "source" to use for span configs (choosing between the gossiped system config span or the span configs infrastructure introduced in \cockroachdb#67679). The 22.1 nodes were using a fallback, statically hard coded config with 3x replication factor. Release note (bug fix): In mixed version clusters running 21.2 and 22.1 nodes, it was possible for CockroachDB to not respect zone configs. This manifested in a few ways: - If num_replicas was set to something other than 3, we would still add or remove replicas to get to 3x replication. - If num_voters was set explicitly to get a mix of voting and non-voting replicas, it would be ignored. CockroachDB could possible remove non-voting replicas. - If range_min_bytes or range_max_bytes were changed from 128 MiB and 512 MiB respectively, we would instead try to size ranges to be within [128 MiB, 512MiB]. This could appear as an excess amount of range splits or merges, as visible in the Replication Dashboard under "Range Operations". - If gc.ttlseconds was set to something other than 90000 seconds, we would still GC data only older than 90000s/25h. If the GC TTL was set to something larger than 25h, AOST queries going further back may now start failing. For GC TTLs less than the 25h default, clusters would observe increased disk usage due to more retained garbage. - If constraints, lease_preferences or voter_constraints were set, they would be ignored. Range data and leases would possibly be moved outside where prescribed. This issues only lasted during the mixed-version state, where the cluster was not finalized. Finalization of the 22.1 version happens automatically when all nodes in the cluster are running 22.1. The only exception to this is if users have set cluster.preserve_downgrade_option, where auto-finalization is allowed to proceed once they issue a 'RESET CLUSTER SETTING cluster.preserve_downgrade_option', or 'SET CLUSTER SETTING version = 22.1' explicitly. Cluster version finalization is better documented on https://www.cockroachlabs.com/docs/v22.1/upgrade-cockroach-version. [1]: https://github.com/cockroachlabs/support/issues/2104
703c98f
to
910aff4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. It just timed out since we're using too many nodes in the test (could instead write it to assert on gc.ttlseconds, but I'm being lazy for this backport-only PR).
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks, @arulajmani, and @nvanbenschoten)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assuming you can make CI happy
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @AlexTalks, @arulajmani, and @nvanbenschoten)
In an internal support issue1 we observed that a mixed-version cluster straddling 21.2 and 22.1 (where span configs were first introduced) with a 5x replication factor, the 22.1 nodes were incorrectly down replicating ranges towards 3x replication. This happened because we were missing requisite version gates in KV code when deciding what "source" to use for span configs (choosing between the gossiped system config span or the span configs infrastructure introduced in #67679). The 22.1 nodes were using a fallback, statically hard coded config with 3x replication factor.
Release note (bug fix): In mixed version clusters running 21.2 and 22.1 nodes, it was possible for CockroachDB to not respect zone configs. This manifested in a few ways:
cluster.preserve_downgrade_option, where auto-finalization is allowed to proceed once they issue a 'RESET CLUSTER SETTING cluster.preserve_downgrade_option', or 'SET CLUSTER SETTING version = 22.1' explicitly. Cluster version finalization is better documented on https://www.cockroachlabs.com/docs/v22.1/upgrade-cockroach-version.
Release justification: Bug fix.
Footnotes
https://github.com/cockroachlabs/support/issues/2104 ↩