Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release-16.0] Upgrade-Downgrade Fix: Schema-initialization stuck on semi-sync ACKs while upgrading (#13411) #13441

Merged

Conversation

GuptaManan100
Copy link
Member

Description

This PR is a backport of #13411.

When we introduced the schema-init-db code, we failed to realize that it would start doing writes as part of the code to change the tablet type. As part of the PRS process, we used to call PromoteReplica first, followed by calls to SetReplicationSource.

When a user upgrades from v16 (/v15) to v17 (/v16), as part of PromoteReplica call, the schema-init realizes that there are schema diffs to apply and ends up writing to the database. The problem is that if semi-sync is enabled, all of these writes get blocked indefinitely. Eventually, PromoteReplica fails and this fails the entire PRS call.

In this PR we fix this issue, by altering the PRS flow slightly, where we call SetReplicationSource on all the replicas and PromoteReplica on the new primary in parallel. This allows PromoteReplica to be unblocked just as any semi-sync capable replica reparents to it.

As part of this PR, the upgrade-downgrade tests for manual backups has been augmented as well to start using semi-sync and to follow the correct steps to upgrade the cluser instead of just shutting down all the tablets and restarting all the tablets. Also, we call PlannedReparentShard now in the test instead of InitShardPrimary which has been long deprecated.

Related Issue(s)

Checklist

  • "Backport to:" labels have been added if this change should be back-ported
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on the CI
  • Documentation was added or is not required

Deployment Notes

…semi-sync ACKs while upgrading (vitessio#13411)

* feat: augment backup upgrade-downgrade test to not remove the entire cluster while upgrading

Signed-off-by: Manan Gupta <[email protected]>

* feat: do not set disable-active-reparents on the vttablets

Signed-off-by: Manan Gupta <[email protected]>

* test: use PRS instead of ISP

Signed-off-by: Manan Gupta <[email protected]>

* feat: fix the problem by running PromoteReplica in parallel with SetReplicationSource

Signed-off-by: Manan Gupta <[email protected]>

* feat: fix the manual next release upgrade-downgrade test too

Signed-off-by: Manan Gupta <[email protected]>

* feat: fix typing errors in comments

Signed-off-by: Manan Gupta <[email protected]>

---------

Signed-off-by: Manan Gupta <[email protected]>
@vitess-bot
Copy link
Contributor

vitess-bot bot commented Jul 5, 2023

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • If this is a change that users need to know about, please apply the release notes (needs details) label so that merging is blocked unless the summary release notes document is included.
  • If a test is added or modified, there should be a documentation on top of the test to explain what the expected behavior is what the test does.

If a new flag is being introduced:

  • Is it really necessary to add this flag?
  • Flag names should be clear and intuitive (as far as possible)
  • Help text should be descriptive.
  • Flag names should use dashes (-) as word separators rather than underscores (_).

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow should be required, the maintainer team should be notified.

Bug fixes

  • There should be at least one unit or end-to-end test.
  • The Pull Request description should include a link to an issue that describes the bug.

Non-trivial changes

  • There should be some code comments as to why things are implemented the way they are.

New/Existing features

  • Should be documented, either by modifying the existing documentation or creating new documentation.
  • New features should have a link to a feature request issue or an RFC that documents the use cases, corner cases and test cases.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • vtctl command output order should be stable and awk-able.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from VTop, if used there.

@vitess-bot vitess-bot bot added NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsWebsiteDocsUpdate What it says labels Jul 5, 2023
@GuptaManan100 GuptaManan100 removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says NeedsIssue A linked issue is missing for this Pull Request labels Jul 5, 2023
@github-actions github-actions bot added this to the v16.0.3 milestone Jul 5, 2023
@frouioui frouioui merged commit 99d39f9 into vitessio:release-16.0 Jul 5, 2023
@frouioui frouioui deleted the fix-sidecard-stuck-semi-sync-16 branch July 5, 2023 13:20
@hmaurer hmaurer mentioned this pull request Mar 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants