Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fallback to poller replication lag if heartbeat lag fails #207

Conversation

ejortegau
Copy link
Collaborator

Description

This PR should allow to fallback to poller replication lag tracker in case using the heartbeat lag tracker fails. It is meant as a temporary change to allow us to move to heartbeat lag. The reason this is needed is because moving a shard from poller to heartbeat lag tracking can lead to replicas taken out of service. For example:

  1. A replica is restarted with heartbeat enabled.
  2. It either fails to read from the heartbeat table because it has not yet been created in the source (because the source has not yet been restarted with the heartbeat tracker enabled), or finds the table but its contents are wrong (e.g. the table was manually created but is empty).
  3. The stateManager gets that error from the replication tracker and thinks the replica is unhealthy, taking it out of service.

With the changes in this PR, if there are issues getting the lag from the heartbeat, replication tracker uses poller lag instead, which is what we are currently using.

Once we are fully migrated to using heartbeats, these changes can be reverted.

@ejortegau ejortegau marked this pull request as ready for review February 8, 2024 09:20
@ejortegau ejortegau requested a review from a team as a code owner February 8, 2024 09:20
// rt.mode == tabletenv.Poller or fallback after heartbeat error
mysqlLag, mysqlErr = rt.poller.Status()
if fallbackToPoller && mysqlErr != nil {
return 0, errFallback
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ejortegau what if we just return rt.poller.Status() here? This would give you a more useful error although it wouldn't be clear there was a "fallback"

Wrapping the mysqlErr with errFallback might work too

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to explicitly return a different error when fallback fails. I will do some wrapping, though

Signed-off-by: Eduardo J. Ortega U <[email protected]>
Copy link
Member

@timvaillancourt timvaillancourt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggested one typo-fix in the error, but LGTM 👍

Co-authored-by: Tim Vaillancourt <[email protected]>
Signed-off-by: Eduardo J. Ortega U. <[email protected]>
if heartbeatLag, heartbeatErr = rt.hr.Status(); heartbeatErr == nil {
return heartbeatLag, heartbeatErr
}
fallbackToPoller = true
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this boolean? If either of the case statements are met, we would return out anyways

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then the condition below on line 158 can just be

if mysqlErr != nil {...}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. The intention is to raise different errors in case rt.poller.Status() failed depending on whether fallback was used or not.

@ejortegau ejortegau requested a review from pbibra March 14, 2024 08:01
@ejortegau ejortegau merged commit af819e5 into slack-vitess-r14.0.5 Mar 27, 2024
244 checks passed
@ejortegau ejortegau deleted the ejortegau/simultaneous_heartbeat_poller_replication_tracker branch March 27, 2024 11:40
timvaillancourt added a commit that referenced this pull request May 27, 2024
* Fallback to poller replication lag if heartbeat lag fails

Signed-off-by: Eduardo J. Ortega U <[email protected]>

* Try to make CI pipeline happy

Signed-off-by: Eduardo J. Ortega U <[email protected]>

* Address PR comments

Signed-off-by: Eduardo J. Ortega U <[email protected]>

* Fix typo

Co-authored-by: Tim Vaillancourt <[email protected]>
Signed-off-by: Eduardo J. Ortega U. <[email protected]>

---------

Signed-off-by: Eduardo J. Ortega U <[email protected]>
Signed-off-by: Eduardo J. Ortega U. <[email protected]>
Co-authored-by: Tim Vaillancourt <[email protected]>
timvaillancourt added a commit that referenced this pull request May 28, 2024
* Fallback to poller replication lag if heartbeat lag fails



* Try to make CI pipeline happy



* Address PR comments



* Fix typo




---------

Signed-off-by: Eduardo J. Ortega U <[email protected]>
Signed-off-by: Eduardo J. Ortega U. <[email protected]>
Co-authored-by: Eduardo J. Ortega U <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants