Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Report: Vitess local examples high replica heartbeat lag #14978

Closed
maxenglander opened this issue Jan 18, 2024 · 3 comments · Fixed by #14980
Closed

Bug Report: Vitess local examples high replica heartbeat lag #14978

maxenglander opened this issue Jan 18, 2024 · 3 comments · Fixed by #14980

Comments

@maxenglander
Copy link
Collaborator

maxenglander commented Jan 18, 2024

Overview of the Issue

The @replica and @rdonly tablet in examples/local/101_initial_cluster.sh lag after startup and continue lagging.

I was able to "fix" this by removing --heartbeat_enable from examples/common/scripts/vttablet-up.sh, but I don't know if that's a real fix.

I'm not entirely sure what the goal for heartbeating is in the local examples, but looking through the code, I think what happens is:

  1. On the primary there is an initial heartbeat when the heartbeat writer is opened.
    } else {
    // A one-time kick off of heartbeats upon Open()
    go w.RequestHeartbeats()
    }
  2. The primary enables writes, and then disables them after the on-demand duration elapses.
    w.enableWrites(true)
    w.concurrentHeartbeatRequests++
    time.AfterFunc(w.onDemandDuration, func() {
    w.onDemandMu.Lock()
    defer w.onDemandMu.Unlock()
    w.concurrentHeartbeatRequests--
    if w.concurrentHeartbeatRequests == 0 {
    // means there are currently no more clients interested in heartbeats
    w.enableWrites(false)
    }
    w.allowNextHeartbeatRequest()
    })
  3. After that, there is nothing to request on-demand heartbeats, so no further heartbeats are written.
  4. On the other tablets, --heartbeat_enable enables heartbeat read ticks, which read the same timestamp over and over again after step 3. above.
  5. The /debug/status consults that lag time, so those tablets lag.

The throttler is the only thing in the code base I can see that requests on-demand heartbeats, and as far as I can see it's not enabled in the local examples.

Reproduction Steps

  1. Checkout main
  2. . ./env and make build
  3. cd examples/local, . ../common/env.sh and ./101_initial_cluster.sh
  4. Open the VTTablet /debug/status for the @replica and @rdonly tablets
  5. Observe replica lag increase and increase

Binary Version

./bin/vttablet --version
vttablet version Version: 19.0.0-SNAPSHOT (Git revision 1b328bffb853ed08da621a7d144a01e06a6cf8d3 branch 'main') built on Wed Jan 17 20:03:05 EST 2024 by [email protected] using go1.21.0 darwin/arm64

Operating System and Environment details

* Mac OS Sonoma 14.2.1
* Darwin Kernel Version 23.2.0: Wed Nov 15 21:53:34 PST 2023; root:xnu-10002.61.3~2/RELEASE_ARM64_T8103 arm64
* arm64

Log Fragments

No response

@maxenglander maxenglander added Type: Bug Needs Triage This issue needs to be correctly labelled and triaged Component: Examples Component: Throttler and removed Needs Triage This issue needs to be correctly labelled and triaged labels Jan 18, 2024
@mattlord mattlord added this to the v19.0.0 milestone Jan 18, 2024
@shlomi-noach
Copy link
Contributor

Followup in #15099

@shlomi-noach
Copy link
Contributor

Followup in #15204

@shlomi-noach
Copy link
Contributor

Please see this followup issue: #15303

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants