Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: VTOrc forgetting old instances #12089

Merged
merged 3 commits into from
Jan 18, 2023

Conversation

GuptaManan100
Copy link
Member

@GuptaManan100 GuptaManan100 commented Jan 13, 2023

Description

This PR fixes the bug described in #12088.
The problem was introduced in #11881 which changed the forgetting logic. Earlier we would compare the hostname and port of the tablets already stored with the ones read from the tablet records and remove the ones not available. A peculiar situation arises when we have a tablet whose hostname and port information is cleared out due to restart/reschedule on a different pod. We don't want to remove the tablet record entirely as this can change the quorum that VTOrc sees. This was fixed in #11881 by changing the forgetting logic to compare the tablet aliases instead of the hostname and port.

This however created a different issue wherein we don't delete the old records corresponding to the same tablet! When a tablet gets rescheduled, we don't delete the old record since the alias still matches the ones we read from the tablet records. The new record is also inserted and we end up with 2 records for the same tablet in the vitess_tablet information, one old and one new.

This problem is fixed by this PR by adding alias as an explicit column and making it a unique key. So now when we insert into the table the new record, the old one automatically disappears.

I would actually like to go a step further wherein we don't just make alias a unique key, but also the primary key. We should also rework the InstanceKey field to be the alias instead of identifying the tablets by their hostname and port. However, this isn't in the scope of the bug-fix and I'll do these changes in a separate PR.

Related Issue(s)

Checklist

  • "Backport to:" labels have been added if this change should be back-ported
  • Tests were added or are not required
  • Documentation was added or is not required

Deployment Notes

@GuptaManan100 GuptaManan100 added Type: Bug Component: VTorc Vitess Orchestrator integration labels Jan 13, 2023
@vitess-bot vitess-bot bot added the NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work label Jan 13, 2023
@vitess-bot
Copy link
Contributor

vitess-bot bot commented Jan 13, 2023

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • If this is a change that users need to know about, please apply the release notes (needs details) label so that merging is blocked unless the summary release notes document is included.
  • If a test is added or modified, there should be a documentation on top of the test to explain what the expected behavior is what the test does.

If a new flag is being introduced:

  • Is it really necessary to add this flag?
  • Flag names should be clear and intuitive (as far as possible)
  • Help text should be descriptive.
  • Flag names should use dashes (-) as word separators rather than underscores (_).

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow should be required, the maintainer team should be notified.

Bug fixes

  • There should be at least one unit or end-to-end test.
  • The Pull Request description should include a link to an issue that describes the bug.

Non-trivial changes

  • There should be some code comments as to why things are implemented the way they are.

New/Existing features

  • Should be documented, either by modifying the existing documentation or creating new documentation.
  • New features should have a link to a feature request issue or an RFC that documents the use cases, corner cases and test cases.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • vtctl command output order should be stable and awk-able.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from VTop, if used there.

@vitess-bot vitess-bot bot added the NeedsWebsiteDocsUpdate What it says label Jan 13, 2023
@GuptaManan100 GuptaManan100 removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says labels Jan 13, 2023
Comment on lines +180 to +181
t.Run("change the port and call refreshTabletsInKeyspaceShard again", func(t *testing.T) {
defer func() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prior to these changes, the newly added test would fail saying that the vitess_tablet table had 4 tablets instead of the expected 3.

Signed-off-by: Manan Gupta <[email protected]>
@GuptaManan100 GuptaManan100 merged commit aa7ab65 into vitessio:main Jan 18, 2023
@GuptaManan100 GuptaManan100 deleted the vtorc-forget-fix branch January 18, 2023 05:37
timvaillancourt pushed a commit to slackhq/vitess that referenced this pull request Sep 11, 2023
* test: add a failing test for the case where the port changes for a tablet

Signed-off-by: Manan Gupta <[email protected]>

* feat: fix the issue by adding alias as a unique field

Signed-off-by: Manan Gupta <[email protected]>

* empty-commit

Signed-off-by: Manan Gupta <[email protected]>

Signed-off-by: Manan Gupta <[email protected]>
timvaillancourt pushed a commit to slackhq/vitess that referenced this pull request Mar 15, 2024
* test: add a failing test for the case where the port changes for a tablet

Signed-off-by: Manan Gupta <[email protected]>

* feat: fix the issue by adding alias as a unique field

Signed-off-by: Manan Gupta <[email protected]>

* empty-commit

Signed-off-by: Manan Gupta <[email protected]>

Signed-off-by: Manan Gupta <[email protected]>
timvaillancourt added a commit to slackhq/vitess that referenced this pull request May 16, 2024
* VTOrc running PRS when database_instance empty bug fix. (vitessio#12019)

* feat: convert join with database_instance to a left join and prevent fixes from running if the information from database_instance is unavailable

Signed-off-by: Manan Gupta <[email protected]>

* test: add tests to verify the fix works

Signed-off-by: Manan Gupta <[email protected]>

Signed-off-by: Manan Gupta <[email protected]>

* Timeout Fixes and VTOrc Improvement (vitessio#11881)

* refactor: move tests out of newfeaturestest so that they run on upgrade-downgrade tests too

Signed-off-by: Manan Gupta <[email protected]>

* feat: add failing ers test for handling multiple vttablet failures with default values of flags

Signed-off-by: Manan Gupta <[email protected]>

* feat: add a new lock-timeout flag and use that instead of remote-operation-timeout

Signed-off-by: Manan Gupta <[email protected]>

* feat: augment DownPrimary test to reproduce the issue of VTOrc not handling multiple failures

Signed-off-by: Manan Gupta <[email protected]>

* feat: remove LockShardTimeout configuration from VTOrc and add parallelism to refresh of tablets

Signed-off-by: Manan Gupta <[email protected]>

* log: add more logging lines around ers in vtorc

Signed-off-by: Manan Gupta <[email protected]>

* test: get the test to work

Signed-off-by: Manan Gupta <[email protected]>

* feat: fix usage of wait for replicas timeout

Signed-off-by: Manan Gupta <[email protected]>

* test: fix flags expected output

Signed-off-by: Manan Gupta <[email protected]>

* test: fix race in test now that the function is called in parallel multiple times

Signed-off-by: Manan Gupta <[email protected]>

* feat: fix default of onCloseTimeout to 1 second

Signed-off-by: Manan Gupta <[email protected]>

* test: add failing unit test to refreshTabletsInKeyspaceShard

Signed-off-by: Manan Gupta <[email protected]>

* feat: fix vtorc to not forget a tablet which has been deleted

Signed-off-by: Manan Gupta <[email protected]>

* feat: fix backward compatibility, add tests and release notes docs

Signed-off-by: Manan Gupta <[email protected]>

* test: fix flags output

Signed-off-by: Manan Gupta <[email protected]>

* test: use disable-replication-manager instead of disable-active-reparents to allow vttablets to setup replication when restarted

Signed-off-by: Manan Gupta <[email protected]>

* test: fix flaky test by not checking for an error

Signed-off-by: Manan Gupta <[email protected]>

* feat: handle the case of empty hostname in tablet initialization

Signed-off-by: Manan Gupta <[email protected]>

* feat: update onclose timeout to 10 seconds

Signed-off-by: Manan Gupta <[email protected]>

* test: fix unit test

Signed-off-by: Manan Gupta <[email protected]>

* feat: address review comments

Signed-off-by: Manan Gupta <[email protected]>

* docs: add comments explaining the test functions

Signed-off-by: Manan Gupta <[email protected]>

* feat: add summary docs for 'lock-shard-timeout' deprecation

Signed-off-by: Manan Gupta <[email protected]>

Signed-off-by: Manan Gupta <[email protected]>

* log: also log error in DiscoverInstance when force discovery is specified (vitessio#11936)

Signed-off-by: Manan Gupta <[email protected]>

Signed-off-by: Manan Gupta <[email protected]>

* VTOrc Code Cleanup - generate_base, replace cluster_name with keyspace and shard. (vitessio#12012)

* feat: refactor generate commands of VTOrc to be in a single file

Signed-off-by: Manan Gupta <[email protected]>

* refactor: cleanup create table formatting

Signed-off-by: Manan Gupta <[email protected]>

* feat: cleanup the usage of IsSQLite and IsMySQL

Signed-off-by: Manan Gupta <[email protected]>

* feat: remove unused minimal instance

Signed-off-by: Manan Gupta <[email protected]>

* feat: remove unused table cluster_domain_name

Signed-off-by: Manan Gupta <[email protected]>

* feat: fix vtorc database to store keyspace and shard instead of cluster

Signed-off-by: Manan Gupta <[email protected]>

* feat: remove unused attributes

Signed-off-by: Manan Gupta <[email protected]>

* feat: remove unused cluster domain

Signed-off-by: Manan Gupta <[email protected]>

* feat: change GetClusterName to GetKeyspaceAndShardName

Signed-off-by: Manan Gupta <[email protected]>

* feat: fix insertion into database_instance

Signed-off-by: Manan Gupta <[email protected]>

* feat: fix SnapshotTopologies

Signed-off-by: Manan Gupta <[email protected]>

* feat: remove inject unseen primary and inject seed

Signed-off-by: Manan Gupta <[email protected]>

* feat: remove ClusterName from Instance

Signed-off-by: Manan Gupta <[email protected]>

* feat: fix Audit operations

Signed-off-by: Manan Gupta <[email protected]>

* feat: add Keyspace and Shard to cluster information to replace ClusterName

Signed-off-by: Manan Gupta <[email protected]>

* feat: fix attempt failure detection registeration

Signed-off-by: Manan Gupta <[email protected]>

* feat: fix blocked topology recoveries

Signed-off-by: Manan Gupta <[email protected]>

* feat: fix topology recovery

Signed-off-by: Manan Gupta <[email protected]>

* feat: reading recovery instances

Signed-off-by: Manan Gupta <[email protected]>

* feat: fix get replication and analysis

Signed-off-by: Manan Gupta <[email protected]>

* feat: fix bug in query

Signed-off-by: Manan Gupta <[email protected]>

* test: add tests to check that filtering by keyspace works for APIs

Signed-off-by: Manan Gupta <[email protected]>

* feat: remove remaining usages of ClusterName

Signed-off-by: Manan Gupta <[email protected]>

* refactor: fix comment explaining sleep in the test

Signed-off-by: Manan Gupta <[email protected]>

* feat: add code to prevent filtering just by shard and add tests for it

Signed-off-by: Manan Gupta <[email protected]>

Signed-off-by: Manan Gupta <[email protected]>

* Fix insert query of blocked_recovery table in VTOrc (vitessio#12091)

* feat: add failing test and fix the query of insertion

Signed-off-by: Manan Gupta <[email protected]>

* empty-commit

Signed-off-by: Manan Gupta <[email protected]>

Signed-off-by: Manan Gupta <[email protected]>

* Fix: VTOrc forgetting old instances (vitessio#12089)

* test: add a failing test for the case where the port changes for a tablet

Signed-off-by: Manan Gupta <[email protected]>

* feat: fix the issue by adding alias as a unique field

Signed-off-by: Manan Gupta <[email protected]>

* empty-commit

Signed-off-by: Manan Gupta <[email protected]>

Signed-off-by: Manan Gupta <[email protected]>

* Move vtorc from go-sqlite3 to modernc.org/sqlite (vitessio#12214)

* Move vtorc from go-sqlite3 to modernc.org/sqlite

This moves vtorc from the go-sqlite3 library that uses CGO, to use
modernc.org/sqlite which is a pure Go implementation.

vtorc is the only component we have to build with CGO but it's causing
pain for releases since we need to build it against an old Linux for
linking against glibc.

Using modernc.org/sqlite allows for using Go only again and makes all
Vitess components buildable without CGO.

In
https://datastation.multiprocess.io/blog/2022-05-12-sqlite-in-go-with-and-without-cgo.html
someone ran some basic benchmarks. It shows that the pure Go version can
be twice as slow, but the usage of vtorc is very limited and we operate
on small datasets, so I think the performance impact purely of a
somewhat slower sqlite implementation is negligable.

None of this is in a hot query serving path or anything like that, so I
have little concern performance wise.

Signed-off-by: Dirkjan Bussink <[email protected]>

* Fix error handling in RowToArray

Signed-off-by: Dirkjan Bussink <[email protected]>

---------

Signed-off-by: Dirkjan Bussink <[email protected]>

* see if CI passes on v14.0.5 as previous release

Signed-off-by: Tim Vaillancourt <[email protected]>

* Revert "see if CI passes on v14.0.5 as previous release"

This reverts commit 53a0e0c.

---------

Signed-off-by: Manan Gupta <[email protected]>
Signed-off-by: Dirkjan Bussink <[email protected]>
Signed-off-by: Tim Vaillancourt <[email protected]>
Co-authored-by: Manan Gupta <[email protected]>
Co-authored-by: Dirkjan Bussink <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: VTorc Vitess Orchestrator integration Type: Bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug Report: VTOrc doesn't forget old instances
3 participants