You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There is a race condition in VDiff when a workflow has multiple tables. Consider a workflow with two tables.
The diff of the first table is deemed done, because too many rows have errors, for example. The corresponding shard streamers are waiting on cancellation of the context passed into vdiff (which times out on action_timeout (1 hour) or on the controller's context cancellation which happens when the entire vdiff (for all tables) is done.
The second table's diff starts. This does creates a new channel source.result = make(chan *sqltypes.Result, 1) for the shard streamers to communicate their rows on.
The second table's diff ends. The corresponding shard streamers for the second table are also waiting on the same conditions as the first table.
Since all tables have completed vdiff the controller's context is done and both sets of shard streamers try to close(source.result), which is pointing to the same channel due to the race.
This results in a vttablet panic for the second streamer.
The race which I encountered happened under conditions of load where there was a load simulator running DMLs at ~1K QPS.
rohit-nayak-ps
changed the title
VDiff: panic for vdiffs with multiple tables under heavy load
VDiff: vttablet panics for vdiffs with multiple tables under heavy load
Oct 24, 2023
Overview of the Issue
There is a race condition in VDiff when a workflow has multiple tables. Consider a workflow with two tables.
action_timeout
(1 hour) or on the controller's context cancellation which happens when the entire vdiff (for all tables) is done.source.result = make(chan *sqltypes.Result, 1)
for the shard streamers to communicate their rows on.close(source.result)
, which is pointing to the same channel due to the race.The race which I encountered happened under conditions of load where there was a load simulator running DMLs at ~1K QPS.
Reproduction Steps
Happened while working on #14345
Binary Version
The text was updated successfully, but these errors were encountered: