Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-22.2: gossip: reintroduce GossipClientsKey gossip #103788

Merged

Conversation

erikgrinaker
Copy link
Contributor

@erikgrinaker erikgrinaker commented May 23, 2023

In 2114678, we stopped gossiping the connectivity graph via GossipClientsKey. However, this broke backwards compatibility with cockroach node status which relied on this key to determine is_live liveness.

This patch partially reverts that change, by reintroducing gossip of these keys in mixed 22.1/22.2 clusters, but otherwise not using them for anything. We stop gossiping them in finalized 22.2 clusters. This breaks backwards compatibility with <22.2.3, we'll just have to live with that.

Touches #89613.
Touches #51838.

Epic: none
Release note (bug fix): Fixed a bug where cockroach node status could incorrectly report nodes as is_live = false in mixed 22.1/22.2 clusters. The bug still exists between 22.2 patch versions before and after 22.2.3.

@erikgrinaker erikgrinaker requested a review from irfansharif May 23, 2023 15:17
@erikgrinaker erikgrinaker self-assigned this May 23, 2023
@erikgrinaker erikgrinaker requested a review from a team as a code owner May 23, 2023 15:17
@blathers-crl
Copy link

blathers-crl bot commented May 23, 2023

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Patches should only be created for serious issues or test-only changes.
  • Patches should not break backwards-compatibility.
  • Patches should change as little code as possible.
  • Patches should not change on-disk formats or node communication protocols.
  • Patches should not add new functionality.
  • Patches must not add, edit, or otherwise modify cluster versions; or add version gates.
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters.
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.

Add a brief release justification to the body of your PR to justify this backport.

Some other things to consider:

  • What did we do to ensure that a user that doesn’t know & care about this backport, has no idea that it happened?
  • Will this work in a cluster of mixed patch versions? Did we test that?
  • If a user upgrades a patch version, uses this feature, and then downgrades, what happens?

@blathers-crl
Copy link

blathers-crl bot commented May 23, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm reminding myself of the history in #89613. I'm not sure this is worth fixing given the CPU overhead this is introducing in turn, but perhaps we could just publish a release note calling it out instead.

  • We know exactly the 22.2.x/23.1 releases where is_live is a false positive, and those users can use newer patch releases of 22.2.x while still in their mixed version state and be fine.
  • 22.1.x is out of the maintenance window tomorrow (2023-05-24, https://www.cockroachlabs.com/docs/releases/release-support-policy.html). This is_live false positive, given it happens with 22.1 clusters present, could be spun as as a known issue in their release that's now out of support but is fixed in 22.2.3+.

What do you think?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)

@erikgrinaker
Copy link
Contributor Author

I'm inclined to agree. We could version gate this such that we stop gossiping in finalized 22.2 clusters while retaining 22.1 compatibility, and live with the <22.2.3 incompatibility.

@erikgrinaker erikgrinaker force-pushed the 22.2-revert-gossip-key branch from dc25d5c to 5e935c7 Compare May 24, 2023 07:46
@blathers-crl blathers-crl bot requested a review from irfansharif May 24, 2023 07:46
@erikgrinaker
Copy link
Contributor Author

I gated this on V22_2, such that we stop gossiping these keys in finalized 22.2 clusters.

@erikgrinaker erikgrinaker force-pushed the 22.2-revert-gossip-key branch from 5e935c7 to 158b09f Compare May 24, 2023 08:19
@irfansharif irfansharif added the O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs label May 24, 2023
In 2114678, we stopped gossiping the connectivity graph via
`GossipClientsKey`. However, this broke backwards compatibility with
`cockroach node status` which relied on this key to determine `is_live`
liveness.

This patch partially reverts that change, by reintroducing gossip of
these keys in mixed 22.1/22.2 clusters, but otherwise not using them for
anything. We stop gossiping them in finalized 22.2 clusters, since the
cost of gossiping these is significant in large clusters (which
motivated their removal in the first place). This breaks backwards
compatibility with <22.2.3, we'll just have to live with that.

Epic: none
Release note (bug fix): Fixed a bug where `cockroach node status` could
incorrectly report nodes as `is_live = false` in mixed 22.1/22.2
clusters. The bug still exists between 22.2 patch versions before and
after 22.2.3.
@erikgrinaker erikgrinaker force-pushed the 22.2-revert-gossip-key branch from 158b09f to a6ee652 Compare May 25, 2023 08:27
@erikgrinaker
Copy link
Contributor Author

Verified the fix in a mixed-version clusters. TFTR!

@erikgrinaker erikgrinaker merged commit c4c44b4 into cockroachdb:release-22.2 May 25, 2023
@erikgrinaker erikgrinaker deleted the 22.2-revert-gossip-key branch May 30, 2023 12:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants