Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-22.1: kvserver,spanconfig: add requisite version gates #98094

Conversation

irfansharif
Copy link
Contributor

@irfansharif irfansharif commented Mar 6, 2023

In an internal support issue1 we observed that a mixed-version cluster straddling 21.2 and 22.1 (where span configs were first introduced) with a 5x replication factor, the 22.1 nodes were incorrectly down replicating ranges towards 3x replication. This happened because we were missing requisite version gates in KV code when deciding what "source" to use for span configs (choosing between the gossiped system config span or the span configs infrastructure introduced in #67679). The 22.1 nodes were using a fallback, statically hard coded config with 3x replication factor.

Release note (bug fix): In mixed version clusters running 21.2 and 22.1 nodes, it was possible for CockroachDB to not respect zone configs. This manifested in a few ways:

  • If num_replicas was set to something other than 3, we would still add or remove replicas to get to 3x replication.
    • If num_voters was set explicitly to get a mix of voting and non-voting replicas, it would be ignored. CockroachDB could possible remove non-voting replicas.
  • If range_min_bytes or range_max_bytes were changed from 128 MiB and 512 MiB respectively, we would instead try to size ranges to be within [128 MiB, 512MiB]. This could appear as an excess amount of range splits or merges, as visible in the Replication Dashboard under "Range Operations".
  • If gc.ttlseconds was set to something other than 90000 seconds, we would still GC data only older than 90000s/25h. If the GC TTL was set to something larger than 25h, AOST queries going further back may now start failing. For GC TTLs less than the 25h default, clusters would observe increased disk usage due to more retained garbage.
  • If constraints, lease_preferences or voter_constraints were set, they would be ignored. Range data and leases would possibly be moved outside where prescribed. This issues only lasted during the mixed-version state, where the cluster was not finalized. Finalization of the 22.1 version happens automatically when all nodes in the cluster are running 22.1. The only exception to this is if users have set
    cluster.preserve_downgrade_option, where auto-finalization is allowed to proceed once they issue a 'RESET CLUSTER SETTING cluster.preserve_downgrade_option', or 'SET CLUSTER SETTING version = 22.1' explicitly. Cluster version finalization is better documented on https://www.cockroachlabs.com/docs/v22.1/upgrade-cockroach-version.

Release justification: Bug fix.

Footnotes

  1. https://github.com/cockroachlabs/support/issues/2104

@irfansharif irfansharif requested a review from a team as a code owner March 6, 2023 21:05
@blathers-crl
Copy link

blathers-crl bot commented Mar 6, 2023

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Patches should only be created for serious issues or test-only changes.
  • Patches should not break backwards-compatibility.
  • Patches should change as little code as possible.
  • Patches should not change on-disk formats or node communication protocols.
  • Patches should not add new functionality.
  • Patches must not add, edit, or otherwise modify cluster versions; or add version gates.
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters.
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.

Add a brief release justification to the body of your PR to justify this backport.

Some other things to consider:

  • What did we do to ensure that a user that doesn’t know & care about this backport, has no idea that it happened?
  • Will this work in a cluster of mixed patch versions? Did we test that?
  • If a user upgrades a patch version, uses this feature, and then downgrades, what happens?

@blathers-crl
Copy link

blathers-crl bot commented Mar 6, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@irfansharif irfansharif requested a review from AlexTalks March 6, 2023 21:05
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@irfansharif irfansharif force-pushed the 230306.missing-version-gates branch from 7fad424 to 703c98f Compare March 6, 2023 22:04
@irfansharif irfansharif requested a review from a team March 6, 2023 22:04
@irfansharif
Copy link
Contributor Author

(Added a unit test that fails before these changes.)

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something makes your test sad under race. Maybe skip it?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks, @arulajmani, and @nvanbenschoten)

In an internal support issue[1] we observed that a mixed-version cluster
straddling 21.2 and 22.1 (where span configs were first introduced)
with a 5x replication factor, the 22.1 nodes were incorrectly
down replicating ranges towards 3x replication. This happened because we
were missing requisite version gates in KV code when deciding what
"source" to use for span configs (choosing between the gossiped
system config span or the span configs infrastructure introduced in
\cockroachdb#67679). The 22.1 nodes were using a fallback, statically hard coded
config with 3x replication factor.

Release note (bug fix): In mixed version clusters running 21.2 and 22.1
nodes, it was possible for CockroachDB to not respect zone configs. This
manifested in a few ways:
- If num_replicas was set to something other than 3, we would still
  add or remove replicas to get to 3x replication.
  - If num_voters was set explicitly to get a mix of voting and
    non-voting replicas, it would be ignored. CockroachDB could possible
    remove non-voting replicas.
- If range_min_bytes or range_max_bytes were changed from 128 MiB and
  512 MiB respectively, we would instead try to size ranges to be within
  [128 MiB, 512MiB]. This could appear as an excess amount of range
  splits or merges, as visible in the Replication Dashboard under "Range
  Operations".
- If gc.ttlseconds was set to something other than 90000 seconds, we
  would still GC data only older than 90000s/25h. If the GC TTL was set
  to something larger than 25h, AOST queries going further back may now
  start failing. For GC TTLs less than the 25h default, clusters would
  observe increased disk usage due to more retained garbage.
- If constraints, lease_preferences or voter_constraints were set, they
  would be ignored. Range data and leases would possibly be moved
  outside where prescribed.
This issues only lasted during the mixed-version state, where the
cluster was not finalized. Finalization of the 22.1 version happens
automatically when all nodes in the cluster are running 22.1. The only
exception to this is if users have set
cluster.preserve_downgrade_option, where auto-finalization is allowed to
proceed once they issue a 'RESET CLUSTER SETTING
cluster.preserve_downgrade_option', or 'SET CLUSTER SETTING version =
22.1' explicitly. Cluster version finalization is better documented on
https://www.cockroachlabs.com/docs/v22.1/upgrade-cockroach-version.

[1]: https://github.com/cockroachlabs/support/issues/2104
@irfansharif irfansharif force-pushed the 230306.missing-version-gates branch from 703c98f to 910aff4 Compare March 7, 2023 20:39
Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. It just timed out since we're using too many nodes in the test (could instead write it to assert on gc.ttlseconds, but I'm being lazy for this backport-only PR).

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks, @arulajmani, and @nvanbenschoten)

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: assuming you can make CI happy

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @AlexTalks, @arulajmani, and @nvanbenschoten)

@irfansharif irfansharif deleted the 230306.missing-version-gates branch May 29, 2023 18:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants