VTTablet _vt: mechanism to reach desired _vt schema needs refactoring #10133

rohit-nayak-ps · 2022-04-24T17:50:24Z

Overview of the Issue

TL;DR

Reaching desired _vt schema is currently fragile and we need a systemic fix. Either by using one of the existing methods better or coming with a new technique.

Motivation

There are two main reasons for this issue:

Rationalization of the process of creating/updating the _vt schema since it is currently difficult to reason and causes intermittent bugs/outages during version upgrades
It is very inefficient. Sidecar schema DDLs are executed in several code paths. Usually they are noops since we already have reached the desired schema. However running these queries do take up non-trivial amount of time. Since they are not in critical paths today it is not affecting the user. However if we want to run them all at tablet init it takes several seconds.

History

Several Vitess modules use the _vt sidecar database. The schema of this database can keep changing as we add more modules or enhance the functionality of a module: new tables can be created or columns created/altered. Existing installations which already contain an older version of the _vt database need to be upgraded automatically to reach the latest schema.

The first module to use _vt was VReplication. We defined an initial schema for the required tables as a list of create table statements. alters or create ddls were added to it, as the schema was upgraded. There was a single entry point, where the vreplication engine was initialized, where we could run all the statements in order to reach the latest schema.

With time more entry points into vreplication were created. As a result, it was possible that, on existing installations, to end up with queries being executed that expected the new schema before the codepath which updated the schema was traversed. On some paths just the error itself would trigger another code path and "self heal", but more usually we would have irrecoverable errors requiring manual schema updation.

The WithDDL module was added by Online DDL to help fix this. WithDDL is initialized the same list of DDLs described above. Any query that targeted the _vt database was required to use the WithDDL.Exec() function to avail of the automatic schema updation: if there was an error while executing the query, the entire list of DDLs were first run, and the query retried. So if the error was due to the schema not being up to date, we would automagically apply the required DDLs.

Instead of refactoring all of VReplication to use WithDDL, a handful of queries in the entry points into VReplication were modified to use WithDDL. As more Vitess functionality started using _vt, more entry points were created. The use of WithDDL was not always enforced either due to oversight or due to the contributors not being aware of it.

To preempt such incidents we started adding adhoc defensive functionality in key code paths. These would access the newest column/table, or trigger an "impossible query", as a proactive method of triggering the required DDLs. At this point we think that we have covered the key entry points into Online DDL and VReplication, but it is obviously a fragile design.

A recent addition, the vtgate schema tracker, used a different method to create its _vt schema. It defined its list of ddls and then tried to execute all of them until successful, as part of the initial health check timer ticks. The approach seems intuitively correct, however this has also been reported to occasionally fail in production clusters and we have not been able to determine why.

Errant GTIDs

To minimize errant gtids we want to leverage the super-read-only feature of mysql (see #10363). However that is currently difficult because the _vt related DDLs are strewn across the code and can be run anytime in the lifecycle of the vttablet depending on when certain modules are invoked. For example, vreplication might never get invoked until years after the cluster is created and the process of running a workflow will create/update the vreplication related _vt tables ...

Possible Approaches

This issue has been created to take a fresh look at our current approach and see if we can come up with a cleaner and simpler design.

Summary of current techniques and the possibility of extending it are:

WithDDL Everywhere: we could go through the code to ensure all vreplication queries use WithDDL. Enforcing/validating this is difficult: maybe a new linter can be written.
WithDDL in entry points: Use the "impossible query" (SELECT _vt_no_such_column__init_schema FROM _vt.vreplication) at all entry points to force WithDDL to run. Keeping track of existing and new entry points is not easy. This is also inefficient: all ddls will be attempted each time an entry point is traversed.
In the first tick of the health streamer, when the primary starts serving we can run all the DDLs. However, recently there were a couple of reports where the schema tracker tables (currently using this technique) were not initialized. Also if semi-sync is enabled and there is no quorum yet this will hang until other replicas come up.
An orthogonal idea: we can use schemadiff to generate the required ddls rather than defining each ddl. Each module only defines its latest schema and whatever mechanism we decide uses schemadiff to generate the required alters.

Proposed Solution

A new sidecardb package (working title) will both contain the desired schema in a declarative way as well as the code to reach that desired schema
We perform schema updation on the first tick of the health streamer. This can be changed to a different point in the tablet init flow as part of the change to use super-readonly on replicas, as referred to in Task: Add full super_read_only support in Vitess #10363
WithDDL is removed, to ensure that all references are removed

Implementation Notes

Each table will be in its own .sql file organized in directories, one directory per module. We will use go:embed to aggregate and order these.
Add artificial comments to the incremental ddls so that themock db frameworks used in tests can ignore them.

The text was updated successfully, but these errors were encountered:

rohit-nayak-ps · 2023-04-25T15:59:58Z

Fixed via #11520

rohit-nayak-ps added Type: Bug Needs Triage This issue needs to be correctly labelled and triaged Component: VReplication Component: Query Serving and removed Needs Triage This issue needs to be correctly labelled and triaged labels Apr 24, 2022

rohit-nayak-ps mentioned this issue Sep 15, 2022

POC: _vt schema initialization on tablet init using declarative schemas #11235

Closed

3 tasks

rohit-nayak-ps mentioned this issue Oct 17, 2022

vttablet sidecar schema:use schemadiff to reach desired schema on tablet init replacing the withDDL-based approach #11520

Merged

31 tasks

frouioui added this to v16.0.0 Nov 18, 2022

frouioui moved this to Backlog in v16.0.0 Nov 18, 2022

frouioui added Early in Cycle These items are prioritised for the current release cycle and removed Early in Cycle These items are prioritised for the current release cycle labels Nov 18, 2022

frouioui removed this from v16.0.0 Nov 18, 2022

deepthi added this to v16.0.0 Dec 1, 2022

deepthi moved this to Backlog in v16.0.0 Dec 1, 2022

deepthi moved this from Backlog to In Progress in v16.0.0 Dec 1, 2022

deepthi assigned rsajwani and rohit-nayak-ps Dec 1, 2022

rohit-nayak-ps mentioned this issue Feb 1, 2023

Tablet Init refactor followup #12207

Closed

5 tasks

frouioui removed this from v16.0.0 Feb 8, 2023

frouioui added this to v17.0.0 Feb 8, 2023

frouioui moved this to In Progress in v17.0.0 Feb 8, 2023

rohit-nayak-ps mentioned this issue Feb 15, 2023

SidecarDB Init: don't fail on schema init errors #12328

Merged

3 tasks

rohit-nayak-ps closed this as completed Apr 25, 2023

github-project-automation bot moved this from In Progress to Done in v17.0.0 Apr 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VTTablet _vt: mechanism to reach desired _vt schema needs refactoring #10133

VTTablet _vt: mechanism to reach desired _vt schema needs refactoring #10133

rohit-nayak-ps commented Apr 24, 2022 •

edited

Loading

rohit-nayak-ps commented Apr 25, 2023

VTTablet _vt: mechanism to reach desired _vt schema needs refactoring #10133

VTTablet _vt: mechanism to reach desired _vt schema needs refactoring #10133

Comments

rohit-nayak-ps commented Apr 24, 2022 • edited Loading

Overview of the Issue

TL;DR

Motivation

History

Errant GTIDs

Possible Approaches

Proposed Solution

Implementation Notes

rohit-nayak-ps commented Apr 25, 2023

rohit-nayak-ps commented Apr 24, 2022 •

edited

Loading