-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VStreamer: fix deadlock when there are a lot of vschema changes at the same time as binlog events #11325
VStreamer: fix deadlock when there are a lot of vschema changes at the same time as binlog events #11325
Conversation
…stream load Signed-off-by: Rohit Nayak <[email protected]>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Bug fixes
Non-trivial changes
New/Existing features
Backward compatibility
|
The VStreamer engine is somewhat unusual in two ways: 1. It is open and running on replica tablets rather than only running on primary tablets. 2. It has no controllers so the main engine mutex is widely shared. Because of this, when a tablet has open vstreams (direct binary log streams) performing work and a state transition starts, it can deadlock with the tabletmanager's state lock when checking if the engine is open or not. Signed-off-by: Matt Lord <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! I had minor comments/suggestions/questions, so will unblock you and let you make the final call on those things.
|
||
flags := &vtgatepb.VStreamFlags{} | ||
|
||
ctx2, cancel := context.WithTimeout(ctx, 2*time.Minute) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This timeout is equal to the extendedTimeout var, right? Might as well use that IMO so that we can easily change timeouts in various places.
"Args": ["vitess.io/vitess/go/test/endtoend/vreplication", "-run", "TestVSchemaChangesUnderLoad"], | ||
"Command": [], | ||
"Manual": false, | ||
"Shard": "vreplication_cellalias", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably rename the shard, e.g. vreplication_cellalias_vschema or something.
Signed-off-by: Rohit Nayak <[email protected]>
VStreamer: fix deadlock when there are a lot of vschema changes at the same time as binlog events (vitessio#11325) * Don't block on vschema channel in case of heavy vschema changes and vstream load Signed-off-by: Rohit Nayak <[email protected]> * Prevent VStreamer engine deadlocks during state transitions The VStreamer engine is somewhat unusual in two ways: 1. It is open and running on replica tablets rather than only running on primary tablets. 2. It has no controllers so the main engine mutex is widely shared. Because of this, when a tablet has open vstreams (direct binary log streams) performing work and a state transition starts, it can deadlock with the tabletmanager's state lock when checking if the engine is open or not. Signed-off-by: Matt Lord <[email protected]> * Address review comments Signed-off-by: Rohit Nayak <[email protected]> Signed-off-by: Rohit Nayak <[email protected]> Signed-off-by: Matt Lord <[email protected]> Co-authored-by: Matt Lord <[email protected]> Signed-off-by: Vilius Okockis <[email protected]>
Description
This fixes the bug reported in #11169
Pre-conditions for bug:
Bug cause:
vstreamer.SetVSchema()
the new vschema is written to a single message buffered channel for each active vstreamer. This is read by the vstreamer event processor, holding the vstreamer engine mutex.case
statements: one reads from the vschema channel and other reads the binlog events. The binlog events are streamed to the receiver. If the receiver is not reading them then the send will block. This means that the othercase
which polls the vschema change channel will not get called. This results in the vstreamer engine lock being held until all events are streamedThis PR drains the vschema channel so that only the latest vschema update is sent. So it no longer holds the vstreamer lock for an extended time allowing the PRS to proceed.
This PR also incorporates the changes made in #11268 which also solves the same issue in a different way: the vschema update no longer holds a lock (since it uses
atomic.int32
to signal whether the vstreamer engine is open, rather than the mutex). #11268 also adds checks for cancelled contexts.Related Issue(s)
fixes #11169
Checklist