Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MDEV-35355 : Galera test failure on galera_sr.mysql-wsrep-features#165 #3619

Open
wants to merge 1 commit into
base: 10.5
Choose a base branch
from

Conversation

janlindstrom
Copy link
Contributor

  • The Jira issue number for this PR is: MDEV-35355

Description

Problem was that in DeadlockChecker::trx_rollback() we hold lock_sys before we call wsrep_handle_SR_rollback() where THD::LOCK_thd_data (and some cases THD::LOCK_thd_kill) are acquired. This is against current mutex ordering rules.

However, acquiring THD::LOCK_thd_data is not necessary because we always are in victim_thd context, either client session is rolling back or rollbacker thread should be in control. Therefore, we should always use wsrep_thd_self_abort() and then no additional mutexes are required.

Fixed by removing locking of THD::LOCK_thd_data and using only wsrep_thd_self_abort(). In debug builds added assertions to verify that we are always in victim_thd context.

This fix is for MariaDB 10.5 and we already have a test case that sporadically fail in Jenkins before this fix.

Release Notes

TODO: What should the release notes say about this change?
Include any changed system variables, status variables or behaviour. Optionally list any https://mariadb.com/kb/ pages that need changing.

How can this PR be tested?

TODO: modify the automated test suite to verify that the PR causes MariaDB to behave as intended.
Consult the documentation on "Writing good test cases".

If the changes are not amenable to automated testing, please explain why not and carefully describe how to test manually.

Basing the PR against the correct MariaDB version

  • This is a new feature or a refactoring, and the PR is based against the main branch.
  • [ x] This is a bug fix, and the PR is based against the earliest maintained branch in which the bug can be reproduced.

PR quality check

  • [x ] I checked the CODING_STANDARDS.md file and my PR conforms to this where appropriate.
  • [ x] For any trivial modifications to the PR, I am ok with the reviewer making the changes themselves.

Problem was that in DeadlockChecker::trx_rollback() we hold lock_sys
before we call wsrep_handle_SR_rollback() where THD::LOCK_thd_data
(and some cases THD::LOCK_thd_kill) are acquired. This is against
current mutex ordering rules.

However, acquiring THD::LOCK_thd_data is not necessary because we
always are in victim_thd context, either client session is rolling
back or rollbacker thread should be in control. Therefore, we should
always use wsrep_thd_self_abort() and then no additional mutexes are
required.

Fixed by removing locking of THD::LOCK_thd_data and using only
wsrep_thd_self_abort(). In debug builds added assertions to
verify that we are always in victim_thd context.

This fix is for MariaDB 10.5 and we already have a test case
that sporadically fail in Jenkins before this fix.
@janlindstrom janlindstrom added the Codership Codership Galera label Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Codership Codership Galera
Development

Successfully merging this pull request may close these issues.

2 participants