-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: VTOrc forgetting old instances #12089
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…blet Signed-off-by: Manan Gupta <[email protected]>
Signed-off-by: Manan Gupta <[email protected]>
GuptaManan100
requested review from
deepthi,
shlomi-noach and
rsajwani
as code owners
January 13, 2023 09:41
vitess-bot
bot
added
the
NeedsDescriptionUpdate
The description is not clear or comprehensive enough, and needs work
label
Jan 13, 2023
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
If a new flag is being introduced:
If a workflow is added or modified:
Bug fixes
Non-trivial changes
New/Existing features
Backward compatibility
|
GuptaManan100
removed
NeedsDescriptionUpdate
The description is not clear or comprehensive enough, and needs work
NeedsWebsiteDocsUpdate
What it says
labels
Jan 13, 2023
GuptaManan100
commented
Jan 13, 2023
Comment on lines
+180
to
+181
t.Run("change the port and call refreshTabletsInKeyspaceShard again", func(t *testing.T) { | ||
defer func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prior to these changes, the newly added test would fail saying that the vitess_tablet
table had 4 tablets instead of the expected 3.
deepthi
approved these changes
Jan 13, 2023
Signed-off-by: Manan Gupta <[email protected]>
rsajwani
approved these changes
Jan 17, 2023
timvaillancourt
pushed a commit
to slackhq/vitess
that referenced
this pull request
Sep 11, 2023
* test: add a failing test for the case where the port changes for a tablet Signed-off-by: Manan Gupta <[email protected]> * feat: fix the issue by adding alias as a unique field Signed-off-by: Manan Gupta <[email protected]> * empty-commit Signed-off-by: Manan Gupta <[email protected]> Signed-off-by: Manan Gupta <[email protected]>
4 tasks
timvaillancourt
pushed a commit
to slackhq/vitess
that referenced
this pull request
Mar 15, 2024
* test: add a failing test for the case where the port changes for a tablet Signed-off-by: Manan Gupta <[email protected]> * feat: fix the issue by adding alias as a unique field Signed-off-by: Manan Gupta <[email protected]> * empty-commit Signed-off-by: Manan Gupta <[email protected]> Signed-off-by: Manan Gupta <[email protected]>
4 tasks
timvaillancourt
added a commit
to slackhq/vitess
that referenced
this pull request
May 16, 2024
* VTOrc running PRS when database_instance empty bug fix. (vitessio#12019) * feat: convert join with database_instance to a left join and prevent fixes from running if the information from database_instance is unavailable Signed-off-by: Manan Gupta <[email protected]> * test: add tests to verify the fix works Signed-off-by: Manan Gupta <[email protected]> Signed-off-by: Manan Gupta <[email protected]> * Timeout Fixes and VTOrc Improvement (vitessio#11881) * refactor: move tests out of newfeaturestest so that they run on upgrade-downgrade tests too Signed-off-by: Manan Gupta <[email protected]> * feat: add failing ers test for handling multiple vttablet failures with default values of flags Signed-off-by: Manan Gupta <[email protected]> * feat: add a new lock-timeout flag and use that instead of remote-operation-timeout Signed-off-by: Manan Gupta <[email protected]> * feat: augment DownPrimary test to reproduce the issue of VTOrc not handling multiple failures Signed-off-by: Manan Gupta <[email protected]> * feat: remove LockShardTimeout configuration from VTOrc and add parallelism to refresh of tablets Signed-off-by: Manan Gupta <[email protected]> * log: add more logging lines around ers in vtorc Signed-off-by: Manan Gupta <[email protected]> * test: get the test to work Signed-off-by: Manan Gupta <[email protected]> * feat: fix usage of wait for replicas timeout Signed-off-by: Manan Gupta <[email protected]> * test: fix flags expected output Signed-off-by: Manan Gupta <[email protected]> * test: fix race in test now that the function is called in parallel multiple times Signed-off-by: Manan Gupta <[email protected]> * feat: fix default of onCloseTimeout to 1 second Signed-off-by: Manan Gupta <[email protected]> * test: add failing unit test to refreshTabletsInKeyspaceShard Signed-off-by: Manan Gupta <[email protected]> * feat: fix vtorc to not forget a tablet which has been deleted Signed-off-by: Manan Gupta <[email protected]> * feat: fix backward compatibility, add tests and release notes docs Signed-off-by: Manan Gupta <[email protected]> * test: fix flags output Signed-off-by: Manan Gupta <[email protected]> * test: use disable-replication-manager instead of disable-active-reparents to allow vttablets to setup replication when restarted Signed-off-by: Manan Gupta <[email protected]> * test: fix flaky test by not checking for an error Signed-off-by: Manan Gupta <[email protected]> * feat: handle the case of empty hostname in tablet initialization Signed-off-by: Manan Gupta <[email protected]> * feat: update onclose timeout to 10 seconds Signed-off-by: Manan Gupta <[email protected]> * test: fix unit test Signed-off-by: Manan Gupta <[email protected]> * feat: address review comments Signed-off-by: Manan Gupta <[email protected]> * docs: add comments explaining the test functions Signed-off-by: Manan Gupta <[email protected]> * feat: add summary docs for 'lock-shard-timeout' deprecation Signed-off-by: Manan Gupta <[email protected]> Signed-off-by: Manan Gupta <[email protected]> * log: also log error in DiscoverInstance when force discovery is specified (vitessio#11936) Signed-off-by: Manan Gupta <[email protected]> Signed-off-by: Manan Gupta <[email protected]> * VTOrc Code Cleanup - generate_base, replace cluster_name with keyspace and shard. (vitessio#12012) * feat: refactor generate commands of VTOrc to be in a single file Signed-off-by: Manan Gupta <[email protected]> * refactor: cleanup create table formatting Signed-off-by: Manan Gupta <[email protected]> * feat: cleanup the usage of IsSQLite and IsMySQL Signed-off-by: Manan Gupta <[email protected]> * feat: remove unused minimal instance Signed-off-by: Manan Gupta <[email protected]> * feat: remove unused table cluster_domain_name Signed-off-by: Manan Gupta <[email protected]> * feat: fix vtorc database to store keyspace and shard instead of cluster Signed-off-by: Manan Gupta <[email protected]> * feat: remove unused attributes Signed-off-by: Manan Gupta <[email protected]> * feat: remove unused cluster domain Signed-off-by: Manan Gupta <[email protected]> * feat: change GetClusterName to GetKeyspaceAndShardName Signed-off-by: Manan Gupta <[email protected]> * feat: fix insertion into database_instance Signed-off-by: Manan Gupta <[email protected]> * feat: fix SnapshotTopologies Signed-off-by: Manan Gupta <[email protected]> * feat: remove inject unseen primary and inject seed Signed-off-by: Manan Gupta <[email protected]> * feat: remove ClusterName from Instance Signed-off-by: Manan Gupta <[email protected]> * feat: fix Audit operations Signed-off-by: Manan Gupta <[email protected]> * feat: add Keyspace and Shard to cluster information to replace ClusterName Signed-off-by: Manan Gupta <[email protected]> * feat: fix attempt failure detection registeration Signed-off-by: Manan Gupta <[email protected]> * feat: fix blocked topology recoveries Signed-off-by: Manan Gupta <[email protected]> * feat: fix topology recovery Signed-off-by: Manan Gupta <[email protected]> * feat: reading recovery instances Signed-off-by: Manan Gupta <[email protected]> * feat: fix get replication and analysis Signed-off-by: Manan Gupta <[email protected]> * feat: fix bug in query Signed-off-by: Manan Gupta <[email protected]> * test: add tests to check that filtering by keyspace works for APIs Signed-off-by: Manan Gupta <[email protected]> * feat: remove remaining usages of ClusterName Signed-off-by: Manan Gupta <[email protected]> * refactor: fix comment explaining sleep in the test Signed-off-by: Manan Gupta <[email protected]> * feat: add code to prevent filtering just by shard and add tests for it Signed-off-by: Manan Gupta <[email protected]> Signed-off-by: Manan Gupta <[email protected]> * Fix insert query of blocked_recovery table in VTOrc (vitessio#12091) * feat: add failing test and fix the query of insertion Signed-off-by: Manan Gupta <[email protected]> * empty-commit Signed-off-by: Manan Gupta <[email protected]> Signed-off-by: Manan Gupta <[email protected]> * Fix: VTOrc forgetting old instances (vitessio#12089) * test: add a failing test for the case where the port changes for a tablet Signed-off-by: Manan Gupta <[email protected]> * feat: fix the issue by adding alias as a unique field Signed-off-by: Manan Gupta <[email protected]> * empty-commit Signed-off-by: Manan Gupta <[email protected]> Signed-off-by: Manan Gupta <[email protected]> * Move vtorc from go-sqlite3 to modernc.org/sqlite (vitessio#12214) * Move vtorc from go-sqlite3 to modernc.org/sqlite This moves vtorc from the go-sqlite3 library that uses CGO, to use modernc.org/sqlite which is a pure Go implementation. vtorc is the only component we have to build with CGO but it's causing pain for releases since we need to build it against an old Linux for linking against glibc. Using modernc.org/sqlite allows for using Go only again and makes all Vitess components buildable without CGO. In https://datastation.multiprocess.io/blog/2022-05-12-sqlite-in-go-with-and-without-cgo.html someone ran some basic benchmarks. It shows that the pure Go version can be twice as slow, but the usage of vtorc is very limited and we operate on small datasets, so I think the performance impact purely of a somewhat slower sqlite implementation is negligable. None of this is in a hot query serving path or anything like that, so I have little concern performance wise. Signed-off-by: Dirkjan Bussink <[email protected]> * Fix error handling in RowToArray Signed-off-by: Dirkjan Bussink <[email protected]> --------- Signed-off-by: Dirkjan Bussink <[email protected]> * see if CI passes on v14.0.5 as previous release Signed-off-by: Tim Vaillancourt <[email protected]> * Revert "see if CI passes on v14.0.5 as previous release" This reverts commit 53a0e0c. --------- Signed-off-by: Manan Gupta <[email protected]> Signed-off-by: Dirkjan Bussink <[email protected]> Signed-off-by: Tim Vaillancourt <[email protected]> Co-authored-by: Manan Gupta <[email protected]> Co-authored-by: Dirkjan Bussink <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR fixes the bug described in #12088.
The problem was introduced in #11881 which changed the forgetting logic. Earlier we would compare the hostname and port of the tablets already stored with the ones read from the tablet records and remove the ones not available. A peculiar situation arises when we have a tablet whose hostname and port information is cleared out due to restart/reschedule on a different pod. We don't want to remove the tablet record entirely as this can change the quorum that VTOrc sees. This was fixed in #11881 by changing the forgetting logic to compare the tablet aliases instead of the hostname and port.
This however created a different issue wherein we don't delete the old records corresponding to the same tablet! When a tablet gets rescheduled, we don't delete the old record since the alias still matches the ones we read from the tablet records. The new record is also inserted and we end up with 2 records for the same tablet in the
vitess_tablet
information, one old and one new.This problem is fixed by this PR by adding
alias
as an explicit column and making it a unique key. So now when we insert into the table the new record, the old one automatically disappears.I would actually like to go a step further wherein we don't just make
alias
a unique key, but also the primary key. We should also rework the InstanceKey field to be the alias instead of identifying the tablets by their hostname and port. However, this isn't in the scope of the bug-fix and I'll do these changes in a separate PR.Related Issue(s)
Checklist
Deployment Notes