Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement flow based tablet load balancer #41

Closed
wants to merge 4 commits into from
Closed

Conversation

demmer
Copy link
Collaborator

@demmer demmer commented Jan 16, 2023

Description

Prototype of a vtgate -> vttablet load balancer that aims to spread the load equally among replicas in a shard, even with mismatched topologies.

Details

The comment on the new TabletBalancer class explains the motivation and the algorithm:

The tabletBalancer probabalistically orders the list of available tablets into
a ranked order of preference in order to satisfy two high-level goals:

1. Balance the load across the available replicas
2. Prefer a replica in the same cell as the vtgate if possible

In some topologies this is trivial to accomplish by simply preferring tablets in the
local cell, assuming there are a proportional number of local tablets in each cell to
satisfy the inbound traffic to the vtgates in that cell.

However, for topologies with a relatively small number of tablets in each cell, a simple
affinity algorithm does not effectively balance the load.

As a simple example:

  Given three cells with vtgates, four replicas spread into those cells, where each vtgate
  receives an equal query share. If each routes only to its local cell, the tablets will be
  unbalanced since two of them receive 1/3 of the queries, but the two replicas in the same
  cell will only receive 1/6 of the queries.

  Cell A: 1/3 --> vtgate --> 1/3 => vttablet

  Cell B: 1/3 --> vtgate --> 1/3 => vttablet

  Cell C: 1/3 --> vtgate --> 1/6 => vttablet
                         \-> 1/6 => vttablet

Other topologies that can cause similar pathologies include cases where there may be cells
containing replicas but no local vtgates, and/or cells that have only vtgates but no replicas.

For these topologies, the tabletBalancer can be configured in a mode that proportionally assigns
the output flow to each tablet, regardless of whether or not the topology is balanced. The local
cell is still preferred where possible, but only as long as the global query balance is maintained.

To accomplish this goal, the balancer is optionally configured into balanced mode and is given:

* The list of cells that receive inbound traffic to vtgates
* The local cell where the vtgate exists
* The set of tablets and their cells (learned from discovery)

The model assumes equal probability of a query coming from each cell that has a vtgate, i.e.
traffic is effectively load balanced between the cells with vtgates.

Given that information, the balancer builds a simple model to determine how much query load
would go to each tablet if vtgate only routed to its local cell. Then if any tablets are
unbalanced, it shifts the desired allocation away from the local cell preference in order to
even out the query load.

Based on this global model, the vtgate then probabalistically picks a destination for each
query to be sent by ordering the available tablets accordingly.

Assuming each vtgate is configured with and discovers the same information about the topology,
then each should come the the same conclusion about the global flows, and cooperatively should
converge on the desired balanced query load.

Related Issue(s)

Checklist

  • "Backport to:" labels have been added if this change should be back-ported
  • Tests were added or are not required
  • Documentation was added or is not required

Deployment Notes

@github-actions
Copy link

This PR is being marked as stale because it has been open for 30 days with no activity. To rectify, you may do any of the following:

  • Push additional commits to the associated branch.
  • Remove the stale label.
  • Add a comment indicating why it is not stale.

If no action is taken within 7 days, this PR will be closed.

@github-actions github-actions bot added the Stale label Mar 12, 2023
@github-actions
Copy link

This PR was closed because it has been stale for 7 days with no activity.

@github-actions github-actions bot closed this Mar 20, 2023
@demmer demmer mentioned this pull request Apr 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant