-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tablet picker cell alias fallback with local cell preference #11771
Conversation
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
If a new flag is being introduced:
If a workflow is added or modified:
Bug fixes
Non-trivial changes
New/Existing features
Backward compatibility
|
We should add @rohit-nayak-ps @mattlord as code owners for tablet_picker.go and vstream_manager.go. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this! ❤️ The core of it seems good but the user interface is unclear to me.
When you say “a vtgate’s cell alias” what do you mean by that? Do you just mean the vtgate process's cell?
I was imagining, user would specify a vtgate flag of ~ --vstream-prefer-local-cell
and then if the vtgate’s local cell was one of those specified in the vstream gRPC call then we’d prefer any of the tablets in that cell.
So then if you start a vtgate this way: vtgate --tablet_types="in_order:replica,primary" --vstream-prefer-local-cell
we should be able to stream from tablets in the same cell as the vtgate. It seems that instead you were thinking that in the vstream gRPC the user would add a prefix to the cells of localPreference:<local cell, which can be an alias>
but that feels unintuitive to me. Maybe I'm missing some things though.
I’m not sure where Cell alias comes into play, which e.g. could be zone1,zone2,zone3 (a common alias in the past was all cells). Maybe you have N cells per data center, and an alias is for all of those local cells? Or maybe alias is errantly thrown in there?
…icker Signed-off-by: Priya Bibra <[email protected]>
Signed-off-by: Priya Bibra <[email protected]>
Signed-off-by: Priya Bibra <[email protected]>
Signed-off-by: Priya Bibra <[email protected]>
2e9c1c5
to
53ac9fd
Compare
Thanks for the detailed use case and the PR. I wanted to bring up another aspect: the tablet picker is also used by the standard VReplication flows. Prefering the local cell when multiple cells are provided will be useful for these flows as well. The current implementation forces the user to choose between:
The main issue is that the picker doesn't know the default cell of the caller today. One option is to update the tablet picker api, so that we can pass that in and add picker strategy options. The options can be used instead of the "local:" hint (and we will also deprecate the "in_order:" hint for tablet types replacing it with an option). For cell selection the options would be:
I am ok with combining the two defaults and providing the functionality of the ExtendedDefault as the Default. But it will be a breaking change and I will need to validate internally if that choice is acceptable from a backward compatibility viewpoint. Since I am joining this discussion late, I hope I have understood your requirement correctly. I know you have already had a call with @mattlord, @pbibra , but if it will help we can have another one. I was thinking of including the requirements of #11579 as well in the refactor @HenryCaiHaiying. Regarding implementation, since it impacts other vreplication workflows as well, if you prefer, I can work on the tablet picker refactor this week. Your PR can then use that version of the tablet picker: by specifying vtgate's cell as the local cell and any strategy override. The VStream API's VStreamFlags can be used to specify any strategy override. Users with older Debezium will end up using the default behavior in their Vitess cluster. Let us know what you think. |
Hi @rohit-nayak-ps! Thanks for the additional context. Looks like modifying the tablet picker is the better long term change, but I think @HenryCaiHaiying might have some concerns because we'd like this update to be in our v13 release. Modifying tablet picker might make that backport tougher. Let's do a huddle tomorrow and I will start a chat to discuss this in Slack. As for the options, I think having Default and ExtendedDefault be separate is completely fine as long as we can specify these server side in the VStream use case and not have to modify the client side gRPC request. Thank you! |
Started a discussion here: https://vitess.slack.com/archives/C04DCRJS6QZ/p1669761282103889 |
Signed-off-by: Priya Bibra <[email protected]>
Signed-off-by: Priya Bibra <[email protected]>
@pbibra, code changes look good. Sorry, in my previous review, I missed a doc change that needs to be part of this PR too. Similar to this update in an other PR: https://github.com/vitessio/vitess/pull/11874/files#diff-ccb9bf989b7450df22db2bf4bc687668cb4046e1974dd7d5e5e12b2b54a29de3. We need to update |
Signed-off-by: Priya Bibra <[email protected]>
Signed-off-by: pbibra <[email protected]>
Signed-off-by: Priya Bibra <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice work, lgtm!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mostly had some minor comments/nits/suggestions, but I also have some larger concerns about tablet selection and the tablet picker behavior. I'm happy to chat about them via Slack or Zoom this week. Thanks! ❤️
// 2 scenarios: | ||
// | ||
// 1. No cells specified by the client via the gRPC request. | ||
// Tablets from the local cell of the VTGate AND the cell alias that this cell belongs to will be selected by default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think we're missing a period at the end of this line (we can also split it so that it aligns with the other newlines/line lengths).
// Local cell will take precedence. | ||
// | ||
// 2. Cells are specified by the client via the gRPC request | ||
// These cels will take precendence over the default local cell and its alias |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cels->cells
In [PR 11771](https://github.com/vitessio/vitess/pull/11771) we allow for default cell alias fallback during tablet selection for VStreams when client | ||
does not specify list of cells. In addition, we add the option for local cell preference during tablet selection. | ||
The local cell preference takes precedence over tablet type.See PR description for examples. If a client wants to specify local cell preference in the gRPC request, | ||
they can pass in a new "local:" tag with the rest of the cells under VStreamFlags. e.g. "local:,cella,cellb". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested tweaks:
In [PR 11771](https://github.com/vitessio/vitess/pull/11771) we modify the default [TabletPicker](https://vitess.io/docs/16.0/reference/vreplication/tablet_selection/)
behavior during tablet selection for [`VStreams`](https://vitess.io/docs/16.0/concepts/vstream/):
- OLD: look for candidate tablets in the local cell
- NEW: look for candidate tablets in the local cell, if none are found, use the local cell's cell alias — if it has one — as a fallback
In addition, we add support for the `local` notation when the client *does* specify a list of cells, e.g.: `--cells="local:zone1a,zone1b,zone1c"
with `vtctldclient` commands and `VStreamFlags.Cells = "local:zone1a,zone1b,zone1c"` in the
[VStreamFlags](https://pkg.go.dev/vitess.io/vitess/go/vt/proto/vtgate#VStreamFlags) with the vtgate VStream RPC.
The local cell will then always be searched first and takes precedence over any others specified.
See [the PR](https://github.com/vitessio/vitess/pull/11771) description for examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please note that we have in-progress VStream gRPC docs in: vitessio/website#1216
Preview: https://deploy-preview-1216--vitess.netlify.app/docs/16.0/reference/vreplication/vstream/
|
||
tcases := []testCase{ | ||
{ | ||
"local preference", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this would be much easier to read if we had the struct field names, e.g.:
name: "local preference",
...
wantTablets: []uint32{102, 103},
Otherwise the reader needs to keep the struct's field indexes in their head.
@@ -161,7 +161,7 @@ func (tw *TopologyWatcher) loadTablets() { | |||
return | |||
default: | |||
} | |||
log.Errorf("cannot get tablets for cell: %v: %v", tw.cell, err) | |||
log.Errorf("cannot get tablets for cell:%v: %v", tw.cell, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this would be slightly better:
log.Errorf("cannot get tablets for cell %q: %v", tw.cell, err)
for _, cell := range strings.Split(strings.TrimSpace(vs.optCells), ",") { | ||
for i, cell := range strings.Split(strings.TrimSpace(vs.optCells), ",") { | ||
// if the local tag is passed in, we must give local cell priority | ||
// during tablet selection. Append the VTGate's local cell to the list of cells |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More accurately, I think we're prepending the local cell.
log.Info("No cells provided by client, falling back to local cell and alias...\n") | ||
// append the alias this cell belongs to, otherwise appends the vtgate's cell | ||
alias := topo.GetAliasByCell(ctx, vs.ts, vs.vsm.cell) | ||
// an alias was actually found | ||
if alias != vs.vsm.cell { | ||
// send in the vtgate's cell for local cell preference | ||
cells = append(cells, fmt.Sprintf("local:%s", vs.vsm.cell)) | ||
} | ||
cells = append(cells, alias) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, so this new behavior is also specific to vtgate vstreams at this point. I thought we wanted to change the default behavior of the tablet picker itself so that when no cells are specified then we use the cell alias as a fallback when we find no candidate tablets in the local cell. That's what the docs seemed to suggest and I think that we should do that.
Keep in mind that we need to document this new behavior here: https://vitess.io/docs/16.0/reference/vreplication/tablet_selection/
We should also modify it to note that it applies to vtgate vstreams as well.
P.S. I have updated/corrected that page here in a docs PR: https://deploy-preview-1267--vitess.netlify.app/docs/16.0/reference/vreplication/tablet_selection/
{"default-local", "", false, []string{"aa"}}, | ||
{"default-local-cell-alias", "", true, []string{"local:aa", "region1"}}, | ||
{"with-opt-cells", "local:,bb,cc", true, []string{"local:aa", "bb", "cc"}}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here about using the struct field names to make it more readable.
ctx, cancel := context.WithCancel(context.Background()) | ||
defer cancel() | ||
|
||
cell := "aa" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make the local cell a test case variable too. That would have likely highlighted the (what I think is) broken local cell correction code and the "lost cell" issue too.
cellsAlias := &topodatapb.CellsAlias{ | ||
Cells: []string{"aa", "bb"}, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's value in using different cell alias definitions too. i.e. making this a test case string slice variable rather than a bool with a single alias definition.
cells = append(cells, strings.TrimSpace(cell)) | ||
} | ||
} | ||
|
||
// if no override provided in gRPC request, perform cell alias fallback | ||
if len(cells) == 0 { | ||
log.Info("No cells provided by client, falling back to local cell and alias...\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for an extra newline here as log messages always end with one.
closing in favor of solution here: #11999 |
Signed-off-by: Priya Bibra [email protected]
Description
Allow for cell alias fallback during tablet selection for VStreams when client does not specify list of cells. In addition, add the option for local cell preference during tablet selection.
Overall, this PR is trying to solve 2 problems:
Problem 1: We're trying to avoid passing in the list of cells/cell alias that
TabletPicker
can choose from in the gRPC request itself. Instead, we will now select a candidate from a list of tablets in the VTGate's local cell + any cell within a defined cell alias for the local cell, if one exists.e.g. We have a cell alias called
region-1
which includes the following cells{region-1a, region-1b, region-1c}
A VTGate in
region-1b
sets up a VStream. The gRPCVStreamRequest
object does not have anycells
sent from the client, but a cell alias is defined.As a result
TabletPicker.PickForStreaming()
will choose from candidates not just in the VTGate's local cell, but in all cells within theregion-1
alias without the client having to explicitly specify this.Problem 2: Of all the candidates selected by
TabletPicker
, we need a way to allow it to give priority to the local cell. This is where thelocalPreferenceHint
comes in. Two paths in which this can be specified:a. If no cells specified by the client, send in the list of cells to the TabletPicker as
local:region-1b, region-1
. The first item in the list is the local cell we want to prioritize and the second is the cell alias thatregion-1b
belongs to.b. The second way is to allow the client side to also specify this if they choose to pass in the cells within the gRPC request itself rather than use the new flag. They'd send in
local:,region-1
or if no cell alias, then something likelocal:,region-1a, region-1c
within the VStreamRequest object.In both cases,
TabletPicker
can then order candidates like:tablet-1-region-1b
- PRIMARYtablet-2-region-1b
- REPLICAtablet-1-region-1c
- REPLICAtablet-1-region-1a
- REPLICAwhere
tablet-1-region-1b
would be prioritized since it is in the local cell. Thein_order
hint is then applied on top of this, where we order by tablet type within each "group" (local cells first, then all others). For example, if the tablet type ordering isinOrder:REPLICA,PRIMARY
, then we get the following ordering of candidates:tablet-2-region-1b
- REPLICAtablet-1-region-1b
- PRIMARYtablet-1-region-1c
- REPLICAtablet-1-region-1a
- REPLICATesting
We did some testing on the Slack side and looks like cell alias selection and local cell preference is working as expected:
For a vtgate in
us-east-1b
got the following tablet selection logs:Related Issue(s)
More discussion here: https://vitess.slack.com/archives/C0PQY0PTK/p1668557424770139
Checklist
Deployment Notes