Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VReplication: retry in WaitForPos when read of pos is killed off by deadlock detector #10621

Merged
merged 8 commits into from
Jul 2, 2022

Conversation

mattlord
Copy link
Contributor

@mattlord mattlord commented Jun 30, 2022

Description

For whatever reason(s), we've seen the deadlock detector fire more frequently when using MySQL 8.0 in our CI tests. This has, in particular, hit our VReplication related tests where there is very high contention on a small number of _vt.vreplication records. The record(s) in the table are getting read by different connections (by id, which is the Primary Key) while also getting constantly updated in other connections to update the progress: pos, rows_copied, heartbeat, etc.

This has led to the VReplication related tests that use MySQL 8.0 to be flaky, in particular:

  • onlineddl_vrepl_stress_mysql80
  • vreplication_across_db_versions

In this PR we retry the read of the _vt.vreplication.pos field if we get a deadlock detected error when waiting for a tablet to reach a replication position, until we hit the context timeout.

Related Issue(s)

Fixes: #10590
Related-to: #10620

Checklist

  • "Backport me!" label has been added if this change should be backported
  • Tests are not required
  • Documentation is not required

@vitess-bot
Copy link
Contributor

vitess-bot bot commented Jun 30, 2022

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • If this is a change that users need to know about, please apply the release notes (needs details) label so that merging is blocked unless the summary release notes document is included.
  • If a new flag is being introduced, review whether it is really needed. The flag names should be clear and intuitive (as far as possible), and the flag's help should be descriptive.
  • If a workflow is added or modified, each items in Jobs should be named in order to mark it as required. If the workflow should be required, the GitHub Admin should be notified.

Bug fixes

  • There should be at least one unit or end-to-end test.
  • The Pull Request description should either include a link to an issue that describes the bug OR an actual description of the bug and how to reproduce, along with a description of the fix.

Non-trivial changes

  • There should be some code comments as to why things are implemented the way they are.

New/Existing features

  • Should be documented, either by modifying the existing documentation or creating new documentation.
  • New features should have a link to a feature request issue or an RFC that documents the use cases, corner cases and test cases.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • vtctl command output order should be stable and awk-able.

Copy link
Contributor

@shlomi-noach shlomi-noach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thank you for investigating this!

One thing, since calls to dbClient.ExecuteFetch can go into loop, is that we must now check ctx.Done() or ctx.Err() before calling qr, err := dbClient.ExecuteFetch(binlogplayer.ReadVReplicationStatus(uint32(id)), 10). Which requires some delicate refactoring because we then cannot populate the error message with data from qr.

@mattlord
Copy link
Contributor Author

Looks good! Thank you for investigating this!

One thing, since calls to dbClient.ExecuteFetch can go into loop, is that we must now check ctx.Done() or ctx.Err() before calling qr, err := dbClient.ExecuteFetch(binlogplayer.ReadVReplicationStatus(uint32(id)), 10). Which requires some delicate refactoring because we then cannot populate the error message with data from qr.

Good point! If we continually hit the deadlock error then it's possible that we won't get to the select where we read from the context's done channel... I'll think about this.

@mattlord mattlord requested a review from shlomi-noach June 30, 2022 16:35
@mattlord
Copy link
Contributor Author

Looks good! Thank you for investigating this!
One thing, since calls to dbClient.ExecuteFetch can go into loop, is that we must now check ctx.Done() or ctx.Err() before calling qr, err := dbClient.ExecuteFetch(binlogplayer.ReadVReplicationStatus(uint32(id)), 10). Which requires some delicate refactoring because we then cannot populate the error message with data from qr.

Good point! If we continually hit the deadlock error then it's possible that we won't get to the select where we read from the context's done channel... I'll think about this.

@shlomi-noach done here: cb01ba8

Thanks again!

@mattlord
Copy link
Contributor Author

Well, the DCO check flakiness has forced me to run the CI tests quite a few times. Good news is that it has demonstrated that the tests using MySQL 8.0 are now very stable! 🙂They would often need to be run 2 or 3 times to pass, now they're passing on the first try. 🥳

@mattlord mattlord merged commit d037f24 into vitessio:main Jul 2, 2022
@mattlord mattlord deleted the mysql8_ci_flakes branch July 2, 2022 20:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug Report: Investigate MySQL 8.0 CI flakiness
2 participants