Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: VTOrc should change tablet type of tablets that have errant GTIDs on them #13872

Closed
GuptaManan100 opened this issue Aug 29, 2023 · 6 comments · Fixed by #13873
Labels
Component: VTorc Vitess Orchestrator integration Type: Feature

Comments

@GuptaManan100
Copy link
Member

Feature Description

Description

We should get VTOrc to change the tablet type of tablets that have errant GTIDs on them and get the type converted to drained. This way we prevent these tablets from getting promoted down the line and causing a load of problems.

Use Case(s)

If a tablet ends up with errant GTID (by whatever way), and if we don't remove it from the topology, there is a slight chance that it can end up getting promoted. When that happens, it breaks the replication on all the other tablets, leading to down time. This feature of VTOrc to demote a tablet with errant GTIDs would fix this problem.

@shlomi-noach
Copy link
Contributor

The risk is with exhausting the entire replica fleet, so that you end up with no REPLICA or RDONLY server at all. I think you must never change the type of the last available replica, because:

  • you'd end up having nothing to promote
  • a single replica is promotable even if it has an errant GTID, because no one else is there to complain about it

@GuptaManan100
Copy link
Member Author

@shlomi-noach I don't know, I have mixed feelings about that too...

  • I agree it won't be ideal that we end up getting rid of all the REPLICA and RDONLY tablets. That can cause the primary to be stuck on semi-sync and essentially take down the entire cluster.
  • At the same time, I don't want to keep a REPLICA tablet with errant GTIDs around just because we have no other tablets, the reason being that if we do end up promoting that REPLICA and the errant GTIDs are old enough that they have been purged, even the previous primary won't be able to replicate and the cluster would essentially be in the same broken state as before. Also, the errant GTIDs would show up in the SELECT queries to the customer once we promote the REPLICA and going back will be troublesome...

Is there a good way to handle these situations?
I am inclined to say that there are no fixed steps that VTOrc can take in these situations, because the remedy is going to be dependent on the situation.
So, what should we get VTOrc to do? @shlomi-noach @deepthi

@shlomi-noach
Copy link
Contributor

even the previous primary won't be able to replicate and the cluster would essentially be in the same broken state as before.

Unless you take a backup from this server and use it to seed th erest of the tablets.

I think that the suggested approach is super opinionated and that different OSS users will have different opinions. If you can make this configurable - that's good. I'd tell you that in a production environment, I'd prefer having proper alerting on errant GTID, along with tooling to fix the errant GTID, rather than have some automation purge replicas from my cluster to the point of leaving the PRIMARY all by itself. I feel like that's just too risky.

@GuptaManan100
Copy link
Member Author

Alright, I think that can be done.

I'll put this functionality of changing tablet type of tablets with errant GTIDs behind a flag. As far as alerting goes, we already have that, so I think just making this feature optional should be a good addition.

@timvaillancourt
Copy link
Contributor

timvaillancourt commented Sep 5, 2023

Great feature request! I'm expecting mixed feedback based on use-cases here, but adding my perspective below

I'd tell you that in a production environment, I'd prefer having proper alerting on errant GTID, along with tooling to fix the errant GTID, rather than have some automation purge replicas from my cluster to the point of leaving the PRIMARY all by itself. I feel like that's just too risky.

I (personally) agree with this statement 👍

If a tablet ends up with errant GTID (by whatever way), and if we don't remove it from the topology, there is a slight chance that it can end up getting promoted. When that happens, it breaks the replication on all the other tablets, leading to down time.

Replication being broken is very bad, but having no REPLICAs to read from at all would probably be worse for a lot of Production systems I've worked on. For apps that specifically target REPLICAs, no available replicas would result in hard errors, which could have a higher impact than stale results (due to broken replication). I'd also argue that broken replication isn't necessarily "down time" as queries should still return (correct me if I'm wrong here)

I think you must never change the type of the last available replica

I feel this approach (keep at least N x REPLICAs regardless of errant GTID) could be a happy medium between availability and consistency. That feels similar to the "min replicas" feature for replication lag - lagging replicas are ignored, but not to the point there is no capacity

@shlomi-noach
Copy link
Contributor

shlomi-noach commented Sep 5, 2023

@timvaillancourt: updating that in #13873, there's a new configurable behavior (default: false), --convert-tablets-with-errant-gtids. So it's an opt-in behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: VTorc Vitess Orchestrator integration Type: Feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants