Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent VStreamer engine deadlocks during state transitions #11268

Closed
wants to merge 1 commit into from

Conversation

mattlord
Copy link
Contributor

@mattlord mattlord commented Sep 20, 2022

Description

The VStreamer engine is somewhat unusual in two ways:

  1. It is open and running on replica tablets rather than only running on primary tablets.
  2. It has no controllers -- but instead has a list of uvstreamers -- so the main engine mutex is widely shared.

Because of this, when a tablet has open vstreams (direct binary log streams) performing work which require RPCs (such as when handling VSchema updates), and a state transition starts, it can deadlock between the VStreamerEngine mutex, the UVStreamer mutex, and the TabletManager mutex when checking if the engine is open or not as part of the TabletManager's ChangeType RPC call. More specifically, the deadlock seems to be (still trying to figure this out) this as seen using the repro test here:

  1. TabletManager’s (RPC) mutex is held while performing the ChangeType RPC call. It blocks on the VStreamerEngine mutex when opening (or closing) the engine.
  2. VStreamer is streaming and holding the engine mutex, but then needs to handle the ApplyVSchema RPC calls because of VSchema changes. While it's broadcasting these changes the UVStreamer lock is held. It then blocks on the TabletManager’s (RPC) mutex??? Another key factor here is that we can block on the vschema channel when handling the vschema changes. Another workaround was to increase the message buffering well beyond 1 for that channel.
  3. We have a deadlock until we cancel the VSchema RPC call???

The blocking factors involved are:

  1. TabletManager's mutex
  2. VStreamerEngine's mutex
  3. Each UVStreamer's mutex
  4. The VStreamerEngine's vschema channel

Related Issue(s)

Checklist

  • "Backport me!" label has been added if this change should be backported
  • Tests were added or are not required
  • Documentation was added or is not required

@vitess-bot
Copy link
Contributor

vitess-bot bot commented Sep 20, 2022

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • If this is a change that users need to know about, please apply the release notes (needs details) label so that merging is blocked unless the summary release notes document is included.
  • If a new flag is being introduced, review whether it is really needed. The flag names should be clear and intuitive (as far as possible), and the flag's help should be descriptive. Additionally, flag names should use dashes (-) as word separators rather than underscores (_).
  • If a workflow is added or modified, each items in Jobs should be named in order to mark it as required. If the workflow should be required, the GitHub Admin should be notified.

Bug fixes

  • There should be at least one unit or end-to-end test.
  • The Pull Request description should either include a link to an issue that describes the bug OR an actual description of the bug and how to reproduce, along with a description of the fix.

Non-trivial changes

  • There should be some code comments as to why things are implemented the way they are.

New/Existing features

  • Should be documented, either by modifying the existing documentation or creating new documentation.
  • New features should have a link to a feature request issue or an RFC that documents the use cases, corner cases and test cases.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • vtctl command output order should be stable and awk-able.

The VStreamer engine is somewhat unusual in two ways:
  1. It is open and running on replica tablets rather than only
     running on primary tablets.
  2. It has no controllers so the main engine mutex is widely shared.

Because of this, when a tablet has open vstreams (direct binary log
streams) performing work and a state transition starts, it can
deadlock with the tabletmanager's state lock when checking if
the engine is open or not.

Signed-off-by: Matt Lord <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: a vstream client can block a tablet ChangeType from replica => primary
1 participant